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Abstract 



o 

jyj I This paper surveys the emerging science of how to design a "COUective INtelh- 

gence" (COIN). A COIN is a large multi-agent system where: 
i) There is little to no centralized communication or control. 
^ , ii) There is a provided world utility function that rates the possible histories of the 

t:;;^- ' full system. 



In particular, we are interested in COINs in which each agent runs a reinforce- 
ment learning (RL) algorithm. The conventional approach to designing large dis- 



, tributed systems to optimize a world utility does not use agents running RL al- 



gorithms. Rather, that approach begins with explicit modeling of the dynamics 
of the overall system, followed by detailed hand-tuning of the interactions between 



C/5 . 

^ , the components to ensure that they "cooperate" as far as the world utility is con- 

cerned. This approach is labor-intensive, often results in highly nonrobust systems, 
and usually results in design techniques that have limited applicability. 
In contrast, we wish to solve the COIN design problem implicitly, via the "adaptive" 
5^ \ character of the RL algorithms of each of the agents. This approach introduces an 

entirely new, profound design problem: Assuming the RL algorithms are able to 
achieve high rewards, what reward functions for the individual agents will, when 
pursued by those agents, result in high world utility? In other words, what reward 
functions will best ensure that we do not have phenomena like the tragedy of the 
commons, Braess's paradox, or the liquidity trap? 

Although still very young, research specifically concentrating on the COIN design 
problem has already resulted in successes in artificial domains, in particular in 
packet-routing, the leader-follower problem, and in variants of Arthur's El Farol 
bar problem. It is expected that as it matures and draws upon other disciplines re- 
lated to COINs, this research will greatly expand the range of tasks addressable by 
human engineers. Moreover, in addition to drawing on them, such a fully developed 
science of COIN design may provide much insight into other already established 
scientific fields, such as economics, game theory, and population biology. 
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1 INTRODUCTION 



Over the past decade or so two separate developments have occurred in computer science 
whose intersection promises to open a vast new area of research, an area extending far 
beyond the current boundaries of computer science. The first of these developments is 
the growing realization of how useful it would be to be able to control distributed systems 
that have little (if any) centralized communication, and to do so "adaptively" , with min- 
imal reliance on detailed knowledge of the system's small-scale dynamical behavior. The 
second development is the maturing of the discipline of reinforcement learning (RL). 
This is the branch of machine learning that is concerned with an agent who periodi- 
cally receives "reward" signals from the environment that partially reflect the value of 
that agent's private utility function. The goal of an RL algorithm is to determine how, 
using those reward signals, the agent should update its action policy to maximize its 



utility [|142| , 255, 272 1. (Until our detailed discussions below, we will use the term "re- 
inforcement learning" broadly, to include any algorithm of this sort, including ones that 
rely on detailed Bayesian modeling of underlying Markov processes |225, 48, 98 1. 

Intuitively, one might hope that RL would help us solve the distributed control prob- 
lem, since RL is adaptive, and, in particular, since it is not restricted to domains having 
sufficient breadths of communication. However, by itself, conventional single-agent RL 
does not provide a means for controlling large, distributed systems. This is true even 
if the system does have centralized communication. The problem is that the space of 
possible action policies for such systems is too big to be searched. We might imagine as 
a variant using a large set of agents, each controlling only part of the system. Since the 
individual action spaces of such agents would be relatively small, we could realistically 
deploy conventional RL on each one. However, now we face the central question of how 
to map the world utility function concerning the overall system into private utility func- 
tions for each of the agents. In particular, how should we design those private utility 
functions so that each agent can realistically hope to optimize its function, and at the 
same time the collective behavior of the agents will optimize the world utility? 

We use the term "Collective INtelligence" (COIN) to refer to any pair of a large, 
distributed collection of interacting computational processes among which there is little 
to no centralized communication or control, together with a 'world utility' function 
that rates the possible dynamic histories of the collection. The central COIN design 
problem we consider arises when the computational processes run RL algorithms: How, 
without any detailed modeling of the overall system, can one set the utility functions 
for the RL algorithms in a COIN to have the overall dynamics reliably and robustly 
achieve large values of the provided world utility? The benefits of an answer to this 
question would extend beyond the many branches of computer science, having major 
ramifications for many other sciences as well. Section ^ discusses some of those benefits. 
Section ^ reviews previous work that has bearing on the COIN design problem. Section ^ 
section constitutes the core of this chapter. It presents a quick outline of a promising 



2 



mathematical framework for addressing this problem in its most general form, and then 
experimental illustrations of the prescriptions of that framework. Throughout, we will 
use italics for emphasis, single quotes for informally defined terms, and double quotes to 
delineate colloquial terminology. 

2 Background 

There are many design problems that involve distributed computational systems where 
there are strong restrictions on centralized communication ("we can't all talk"); or there 
is communication with a central processor, but that processor is not sufficiently powerful 
to determine how to control the entire system ("we aren't smart enough"); or the pro- 
cessor is powerful enough in principle, but it is not clear what algorithm it could run by 
itself that would effectively control the entire system ("we don't know what to think"). 
Just a few of the potential examples include: 

i) Designing a control system for constellations of communication satellites or for 
constellations of planetary exploration vehicles (world utility in the latter case being 
some measure of quality of scientific data collected); 

ii) Designing a control system for routing over a communication network (world utility 
being some aggregate quality of service measure) 

iii) Construction of parallel algorithms for solving numerical optimization problems 
(the optimization problem itself constituting the world utility); 

iv) Vehicular traffic control, e.g., air traffic control, or high-occupancy toll-lanes for 
automobiles. (In these problems the individual agents are humans and the associated 
utility functions must be of a constrained form, reflecting the relatively inflexible kinds 
of preferences humans possess.); 

v) Routing over a power grid; 

vi) Control of a large, distributed chemical plant; 

vii) Control of the elements of an amorphous computer; 

viii) Control of the elements of a 'noisy' phased array radar; 

ix) Compute-serving over an information grid. 

Such systems may be best controlled with an artificial COIN. However, the potential 
usefulness of deeper understanding of how to tackle the COIN design problem extends 
far beyond such engineering concerns. That's because the COIN design problem is an 
inverse problem, whereas essentially all of the scientific fields that are concerned with 
naturally-occurring distributed systems analyze them purely as a "forward problem." 
That is, those fields analyze what global behavior would arise from provided local dy- 
namical laws, rather than grapple with the inverse problem of how to configure those 
laws to induce desired global behavior. (Indeed, the COIN design problem could almost 
be defined as decentralized adaptive control theory for massively distributed stochastic 
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environments.) It seems plausible that the insights garnered from understanding the 
inverse problem would provide a trenchant novel perspective on those fields. Just as 
tackling the inverse problem in the design of steam engines led to the first true un- 
derstanding of the macroscopic properties of physical bodes (aka thermodynamics), so 
may the cracking of the COIN design problem may improve our understanding of many 
naturally-occurring COINs. In addition, although the focuses of those other fields are 
not on the COIN design problem, in that they are related to the COIN design problem, 
that problem may be able to serve as a "touchstone" for all those fields. This may then 
reveal novel connections between the fields. 

As an example of how understanding the COIN design problem may provide a novel 
perspective on other fields, consider countries with capitalist human economies. Al- 
though there is no intrinsic world utility in such systems, they can still be viewed from 
the perspective of COINs, as naturally occurring COINs. For example, one can declare 
world utility to be a time average of the Gross Domestic Product (GDP) of the country 
in question. (World utility per se is not a construction internal to a human economy, 
but rather something defined from the outside.) The reward functions for the human 
agents in this example could then be the achievements of their personal goals (usually 
involving personal wealth to some degree). 

Now in general, to achieve high world utility in a COIN it is necessary to avoid having 
the agents work at cross-purposes. Otherwise the system is vulnerable to economic 
phenomena like the Tragedy of the Commons (TOG), in which individual avarice works 



to lower world utility |115[ , or the liquidity trap, where behavior that helps the entire 
system when employed by some agents results in poor global behavior when employed by 
all agents | 158| . One way to avoid such phenomena is by modifying the agents' utility 
functions. In the context of capitalist economies, this kind of effect can be achieved via 
punitive legislation that modifies the rewards the agents receive for engaging in certain 
kinds of activity. A real world example of an attempt to make just such a modification 
was the creation of anti-trust regulations designed to prevent monopolistic practices. 

In designing a COIN we usually have more freedom than anti-trust regulators though, 
in that there is no base-line "organic" private utility function over which we must su- 
perimpose legislation-like incentives. Rather, the entire "psychology" of the individual 
agents is at our disposal when designing a COIN. This obviates the need for honesty- 
elicitation ('incentive compatible') mechanisms, like auctions, which form a central com- 
ponent of conventional economics. Accordingly, COINs can differ in certain crucial re- 
spects from human economies. The precise differences — the subject of current research 
— seem likely to present many insights into the functioning of economic structures like 
anti-trust regulators. 

To continue with this example, consider the usefulness, as far as the world utility 
is concerned, of having (commodity, or especially fiat) money in the COIN. Formally, 
from a COIN perspective, the use of 'money' for trading between agents constitutes 
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a particular class of couplings between the states and utility functions of the various 
agents. For example, if one agent's 'bank account' variable goes up in a 'trade' with 
another agent, then a corresponding 'bank account' variable in that other agent must 
decrease to compensate. In addition to this coupling between the agents' states, there 
is also a coupling between their utilities, if one assume that both agents will prefer to 
have more money rather than less, everything else being equal. However one might 
formally define such a 'money' structure, we can consider what happens if it does (or 
does not) obtain for an arbitrary dynamical system, in the context of an arbitrary world 
utility. For some such dynamical systems and world utilities, a money structure will 
improve the value of that world utility. But for the same dynamics, the use of a money 
structure will simultaneously induce low levels of other world utilities (a trivial example 
being a world utility that equals the negative of the first one). This raises a host of 
questions, like how to formally specify the most general set of world utilities that benefits 
significantly from using money-based private utility functions. If one is provided a world 
utility that is not a member of that set, then an "economics-like" configuration of the 
system is likely to result in poor performance. Such a characterization of how and when 
money helps improve world utilities of various sorts might have important implications 
for conventional human economics, especially when one chooses world utility to be one 
of the more popular choices for social welfare function. (See |251, and references 
therein for some of the standard economics work that is most relevant to this issue.) 

There are many other scientific fields that are currently under investigation from a 
COIN-design perspective. Some of them are, like economics, part of (or at least closely 
related to) the social sciences. These fields typically involve RL algorithms under the 
guise of human agents. An example of such a field is game theory, especially game 
theory of bounded rational players. As illustrated in our money example, viewing such 
systems from the perspective of a non-endogenous world utility, i.e., from a COIN-design 
perspective, holds the potential for providing novel insight into them. (In the case of 
game theory, it holds the potential for leading to deeper understanding of many-player 
inverse stochastic game theory.) 

However there are other scientific fields that might benefit from a COIN-design per- 
spective even though they study systems that don't even involve RL algorithms. The idea 
here is that if we viewed such systems from an "artificial" teleological perspective, both 
in concentrating on a non-endogenous world utility and in casting the nodal elements of 
the system as RL algorithms, we could learn a lot about the form of the 'design space' 
in which such systems live. (Just as in economics, where the individual nodal elements 
are RL algorithms, investigating the system using an externally imposed world utility 
might lead to insight.) Examples here are ecosystems (individual genes, individuals, or 
species being the nodal elements) and cells (individual organelles in Eukaryotes being 
the nodal elements). In both cases, the world utility could involve robustness of the 
desired equilibrium against external perturbation, efficient exploitation of free energy in 
the environment, etc. 
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3 Review of Literature Related to COINs 



The following list elaborates what we mean by a COIN: 

1) There are many processors running concurrently, performing actions that affect 
one another's behavior. 

2) There is little to no centralized personalized communication, i.e., little to no be- 
havior in which a small subset of the processors not only communicates with all the 
other processors, but communicates differently with each one of those other processors. 
Any single processor's "broadcasting" the same information to all other processors is not 
precluded. 

3) There is little to no centralized personalized control, i.e., little to no behavior in 
which a small subset of the processors not only controls all the other processors, but 
controls each one of those other processors differently. "Broadcasting" the same control 
signal to all other processors is not precluded. 

4) There is a well-specified task, typically in the form of cxtremizing a utility function, 
that concerns the behavior of the entire distributed system. So we are confronted with 
the inverse problem of how to configure the system to achieve the task. 

The following elements characterize the sorts of approaches to COIN design we are 
concerned with here: 

5) The approach for tackling (4) is scalable to very large numbers of processors. 

6) The approach for tackling (4) is very broadly applicable. In particular, it can work 
when little (if any) "broadcasting" as in (2) and (3) is possible. 

7) The approach for tackling (4) involves little to no hand-tailoring. 

8) The approach for tackling (4) is robust and adaptive, with minimal need to "get 
the details exactly right or else," as far as the stochastic dynamics of the system is 
concerned. 

9) The individual processors are running RL algorithms. Unlike the other elements of 
this list, this one is not an a priori engineering necessity. Rather, it is a reflection of the 
fact that RL algorithms are currently the best-understood and most mature technology 
for addressing the points (8) and (9). 

There are many approaches to COIN design that do not have every one of those 
features. These approaches constitute part of the overall field of COIN design. As 
discussed below though, not having every feature in our list, no single one of those 
approaches can be extended to cover the entire breadth of the field of COIN design. 
(This is not too surprising, since those approaches are parts of fields whose focus is not 
the COIN design problem per se.) 

The rest of this section consists of brief presentations of some of these approaches, and 
in particular characterizes them in terms of our list of nine characteristics of COINs and 
of our desiredata for their design. Of the approaches we discuss, at present it is probably 
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the ones in Artificial Intelligence and Machine Learning that are most directly applicable 
to COIN design. However it is fairly clear how to exploit those approaches for COIN 
design, and in that sense relatively little needs to be said about them. In contrast, as 
currently employed, the toolsets in the social sciences are not as immediately applicable 
to COIN design. However, it seems likely that there is more yet to be discovered about 
how to exploit them for COIN design. Accordingly, we devote more space to those social 
science-based approaches here. 

We present an approach that holds promise for covering all nine of our desired features 
in Section ^ 



3.1 AI and Machine Learning 

There is an extensive body of work in AI and machine learning that is related to COIN 
design. Indeed, one of the most famous speculative works in the field can be viewed as 



an argument that AI should be approached as a COIN design problem [185]. Much 



work of a more concrete nature is also closely related to the problem of COIN design. 



3.1.1 Reinforcement Learning 



As discussed in the introduction, the maturing field of reinforcement learning provides a 
much needed tool for the types of problems addressed by COINs. Because RL generally 
provides model-free|^ and "online" learning features, it is ideally suited for the distributed 
environment where a "teacher" is not available and the agents need to learn successful 
strategies based on "rewards" and "penalties" they receive from the overall system at 
various intervals. It is even possible for the learners to use those rewards to modify how 



they learn |234, 235]. 



Although work on RL dates back to Samuel's checker player |226| , relatively recent 
theoretical ]272] and empirical results [|60t 263] have made RL one of the most active 
areas in machine learning. Many problems ranging from controlling a robot's gait to 
controlling a chemical plant to allocating constrained resource have been addressed with 
considerable success using RL PI, p|, |2T5|, pi. In particular, the RL algorithms 
TD{\) (which rates potential states based on a value function) |25^ ] and Q-learning 



(which rates action-state pairs) [272 ] have been investigated extensively. A detailed 
investigation of RL is available in ]142, 



Although powerful and widely applicable, solitary RL algorithms will not perform 
well on large distributed heterogeneous problems in general. This is due to the very 
big size of the action-policy space for such problems. In addition, without centralized 
communication and control, how a solitary RL algorithm could run the full system at 
all, poorly or well, becomes a major concern.^ For these reasons, it is natural to con- 

There exist some model-based variants of traditional RL. See for example [Q. 

■^One possible solution would be to run the RL off-line on a simulation of the full system and then 
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sider deploying many RL algorithms rather than a single one for these large distributed 
problems. We will discuss the coordination issues such an approach raises in conjunction 
with multi-agent systems in Section [3.1.3 and with learnability in COINs in Section ^. 



3.1.2 Distributed Artificial Intelligence 

The field of Distributed Artificial Intelligence (DAI) has arisen as more and more tra- 
ditional Artificial Intelligence (AI) tasks have migrated toward parallel implementation. 
The most direct approach to such implementations is to directly parallelize AI production 



systems or the underlying programming languages |9^, 221 1. An alternative and more 
challenging approach is to use distributed computing, where not only are the individual 
reasoning, planning and scheduling AI tasks parallelized, but there are different modules 



with different such tasks, concurrently working toward a common goal |137| , |138| , 16£]. 

In a DAI, one needs to ensure that the task has been modularized in a way that im- 
proves efficiency. Unfortunately, this usually requires a central controller whose purpose 
is to allocate tasks and process the associated results. Moreover, designing that con- 
troller in a traditional AI fashion often results in brittle solutions. Accordingly, recently 
there has been a move toward both more autonomous modules and fewer restrictions on 



the interactions among the modules |22£]. 



Despite this evolution, DAI maintains the traditional AI concern with a pre-fixed set 
oi particular aspects of intelligent behavior {e.g. reasoning, understanding, learning etc.) 
rather than on their cumulative character. As the idea that intelligence may have more 
to do with the interaction among components started to take shape |^, focus shifted 
to concepts {e.g., multi-agent systems) that better incorporated that idea |140|. 



3.1.3 Multi-Agent Systems 

The field of Multi-Agent Systems (MAS) is concerned with the interactions among the 
members of such a set of agents |108| , |14C1| , 239| , p60| , as well as the inner workings 



of each agent in such a set {e.g., their learning algorithms) ||3^, 3S, As in compu- 
tational ecologies and computational markets (see below), a well-designed MAS is one 
that achieves a global task through the actions of its components. The associated design 



steps involve 140 |: 



1. Decomposing a global task into distributable subcomponents, yielding tractable 
tasks for each agent; 

2. Establishing communication channels that provide sufficient information to each 
of the agents for it to achieve its task, but are not too unwieldly for the overall 



convey the results to the components of the system at the price of a single centralized initialization {e.g. 



[1951). general though, this approach will suffer from being extremely dependent on "getting the 



details right" in the simulation. 
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system to sustain; and 



3. Coordinating the agents in a way that ensures that they cooperate on the global 
task, or at the very least does not allow them to pursue conflicting strategies in 
trying to achieve their tasks. 

Step (3) is rarely trivial; one of the main difficulties encountered in MAS design is 
that agents act selfishly and artificial cooperation structures have to be imposed on their 
behavior to enforce cooperation |jl2[. An active area of research, which holds promise 
for addressing parts the COIN design problem, is to determine how selfish agents' "in- 
centives" have to be engineered in order to avoid the tragedy of the commons (TOC) 



| 244 |. (This work draws on the economics literature, which we review separately below.) 
When simply providing the right incentives is not sufficient, one can resort to strategies 
that actively induce agents to cooperate rather than act selfishly. In such cases coor- 



dination p4[l[| , negotiations [156], coalition formation p28| , 230, |296[| or contracting 



among agents may be needed to ensure that they do not work at cross purposes. 

Unfortunately, all of these approaches share with DAI and its offshoots the problem 
of relying excessively on hand-tailoring, and therefore being difficult to scale and often 
nonrobust. In addition, except as noted in the next subsection, they involve no RL, and 
therefore the constituent computational elements are usually not as adaptive and robust 
as we would like. 



3.1.4 Reinforcement Learning-Based Multi- Agent Systems 

Because it neither requires explicit modeling of the environment nor having a "teacher" 
that provides the "correct" actions, the approach of having the individual agents in a 
MAS use RL is well-suited for MAS's deployed in domains where one has little knowledge 
about the environment and/or other agents. There are two main approaches to designing 
such MAS's: 

(i) One has 'solipsistic agents' that don't know about each other and whose RL rewards 
are given by the performance of the entire system (so the joint actions of all other agents 
form an "inanimate background" contributing to the reward signal each agent receives); 

(ii) One has 'social agents' that explicitly model each other and take each others' actions 
into account. 

Both (i) and (ii) can be viewed as ways to (try to) coordinate the agents in a MAS in a 
robust fashion. 

Solipsistic Agents: MAS's with solipsistic agents have been successfully applied to 



a multitude of problems [60, |112|, |122|, 227, p33|]. Generally, these schemes use RL 



algorithms similar to those discussed in Section |3.1.1| . However much of this work lacks 
a well-defined global task or broad applicability {e.g., p27| ]). More generally, none of the 
work with solipsistic agents scales well. (As illustrated in our experiments on the "bar 
problem", recounted below.) The problem is that each agent must be able to discern 
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the effect of its actions on the overall performance of the system, since that performance 
constitutes its reward signal. As the number of agents increases though, the effects of 
any one agent's actions (signal) will be swamped by the effects of other agents (noise), 
making the agent unable to learn well, if at all. (See the discussion below on learnability.) 
In addition, of course, solipsistic agents cannot be used in situations lacking centralized 
calculation and broadcast of the single global reward signal. 

Social agents: MAS's whose agents take the actions of other agents into account syn- 
thesize RL with game theoretic concepts {e.g., Nash equilibrium). They do this to try 
to ensure that the overall system both moves toward achieving the overall global goal 
and avoids often deleterious oscillatory behavior ||5^, 9£, 130, 131, 132]. To that end, the 
agents incorporate internal mechanisms that actively model the behavior of other agents. 
In Section 3.3.1, we discuss a situation where such modeling is necessarily self-defeating. 
More generally, this approach usually involves extensive hand-tailoring for the problem 
at hand. 



3.2 Social Science— Inspired Systems 

Some human economies provides examples of naturally occurring systems that can be 
viewed as a (more or less) well-performing COIN. The field of economics provides much 
more though. Both empirical economics {e.g., economic history, experimental economics) 
and theoretical economics {e.g., general equilibrium theory |^], theory of optimal taxa- 
tion [|189|] ) provide a rich literature on strategic situations where many parties interact. 
In fact, much of the entire field of economics can be viewed as concerning how to max- 
imize certain constrained kinds of world utilities, when there are certain (very strong) 
restrictions on the individual agents and their interactions, and in particular when we 
have limited freedom in setting either the utility functions of those agents or modifying 
their RL algorithms in any other way. 

In this section we summarize just two economic concepts, both of which are very 
closely related to COINs, in that they deal with how a large number of interacting agents 
can function in a stable and efficient manner: general equilibrium theory and mechanism 
design. We then discuss general attempts to apply those concepts to distributed compu- 
tational problems. We follow this with a discussion of game theory, and then present a 
particular celebrated toy-world problem that involves many of these issues. 

3.2.1 General Equilibrium Theory 

Often the first version of "equilibrium" that one encounters in economics is that of 
supply and demand in single markets: the price of the market's good is determined by 
where the supply and demand curves for that good intersect. In cases where there is 
interaction among multiple markets however, even when there is no production but only 
trading, one cannot simply determine the price of each market's good individually, as 
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both the supply and demand for each good depends on the supply /demand of other 
goods. Considering the price fluctuations across markets leads to the concept of 'general 
equilibrium', where prices for each good are determined in such a way to ensure that all 
markets 'clear' |250| ]. Intuitively, this means that prices are set so the total supply of 
each good is equal to the demand for that goodj^ The existence of such an equilibrium, 
proven in Q, was first postulated by Leon Walras [|271|] . A mechanism that calculates the 
equilibrium (i.e., 'market-clearing') prices now bears his name: the Walrasian auctioner. 

In general, for an arbitrary goal for the overall system, there is no reason to believe 
that having markets clear achieves that goal. In other words, there is no a priori reason 
why the general equilibrium point should maximize one's provided world utility function. 
However, consider the case where one's goal for the overall system is in fact that the 
markets clear. In such a context, examine the case where the interactions of real- world 
agents will induce the overall system to adopt the general equilibrium point, so long as 
certain broad conditions hold. Then if we can impose those conditions, we can cause the 
overall system to behave in the manner we wish. However general equilibrium theory is 
not sufficient to establish those "broad conditions", since it says little about real- world 
agents. In particular, general equilibrium theory suffers from having no temporal aspect 
(i.e., no dynamics) and from assuming that all the agents are perfectly rational. 

Another shortcoming of general equilibrium theory as a model of real-world systems 
is that despite its concerning prices, it does not readily accommodate the full concept of 
money [^]. Of the three main roles money plays in an economy (medium of exchange 
in trades, store of value for future trades, and unit of account) none are essential in a 
general equilibrium setting. The unit of account aspect is not needed as the bookkeeping 
is performed by the Walrasian auctioner. Since the supplies and demands are matched 
directly there is no need to facilitate trades, and thus no role for money as a medium 
of exchange. And finally, as the system reaches an equilibrium in one step, through the 
auctioner, there is no need to store value for future trading rounds 209 |. 

The reason that money is not needed can be traced to the fact that there is an 
"overseer" with global information who guides the system. If we remove the centralized 
communication and control exerted by this overseer, then (as in a real economy) agents 

^More formally, each agent's utility is a function of that agent's allotment of all the possible goods. 
In addition, every good has a price. (Utility functions are independent of money.) Therefore, for any set 
of prices for the goods, every agent has a 'budget', given by their initial allotment of goods. We pool 
all the agents' goods together. In the 'tatonnement' (single step) version of market clearing, we next 
allocate the goods back among the agents in such a way that each agent is given a total value of goods 
(as determined by the prices) equal to that agent's budget (as determined by the prices and by that 
agent's initial allotment). As a (formally identical) alternative, we can have a two-step process, in which 
first each agent is given funds equal to its budget, and then each agent decides how to use those funds 
to purchases goods from the central pool. In either case, the 'market clearing' prices are those prices for 
which exactly all of the goods in the pool are reallocated back among the agents (no "excess supply"), 
and for which each agent views its allocation of goods as optimizing its utility, subject to its budget and 
to those prices for the goods (no "excess demand"). Similar definitions hold for a 'production' rather 
than 'endowment' economy. 
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will no longer know the exact details of the overall economy. They will be forced to 
makes guesses as in any learning system, and the differences in those guesses will lead 



to differences in their actions [16C, 161]. 



Such a decentralized learning-based system more closely resembles a COIN than does 
a conventional general equilibrium system. In contrast to general equilibrium systems, 
the three main roles money plays in a human economy are crucial to the dynamics of 
such a decentralized system [^]. This comports with the important effects in COINs of 
having the agents' utility functions involve money (see Background section above). 

3.2.2 Mechanism Design 

Even if there exists centralized communication so that we aren't considering a full-blown 
COIN, if there is no centralized Walras-like control, it is usually highly non-trivial to 
induce the overall system to adopt the General Equilibrium point. One way to try 
to do so is via an auction. (This is the approach usually employed in computational 



markets — see below.) Along with optimal taxation and public good theory [ 157 |, 
the design of auctions is the subject of the field of mechanism design. More generally, 
mechanism design is concerned with the incentives that must be applied to any set 



of agents that interact and exchange goods |18g| , 267 ] in order to get those agents to 
exhibit desired behavior. Usually that desired behavior concerns pre-specified utility 
functions of some sort for each of the individual agents. In particular, mechanism design 
is usually concerned with incentive schemes which induce '(Pareto) efficient' (or 'Pareto 
optimal') allocations in which no agent can be made better off without hurting another 



agent |10g, |lO| 



One particularly important type of such an incentive scheme is an auction. When 
many agents interact in a common environment often there needs to be a structure that 
supports the exchange of goods or information among those agents. Auctions provide 
one such (centralized) structure for managing exchanges of goods. For example, in the 
English auction all the agents come together and 'bid' for a good, and the price of the 
good is increased until only one bidder remains, who gets the good in exchange for the 
resource bid. As another example, in the Dutch auction the price of a good is decreased 
until one buyer is willing to pay the current price. 

All auctions perform the same task: match supply and demand. As such, auctions are 
one of the ways in which price equilibration among a set of interacting agents (perhaps an 
equilibration approximating general equilibrium, perhaps not) can be achieved. However, 
an auction mechanism that induces Pareto efficiency does not necessarily maximize some 
other world utility. For example, in a transaction in an English auction both the seller 
and the buyer benefit. They may even have arrived at an allocation which is efficient. 
However, in that the winner may well have been willing to pay more for the good, such 
an outcome may confound the goal of the market designer, if that designer's goal is to 
maximize revenue. This point is returned to below, in the context of computational 
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economics. 



3.2.3 Computational Economics 

'Computational economies' are schemes inspired by economics, and more specifically by 
general equilibrium theory and mechanism design theory, for managing the components 
of a distributed computational system. They work by having a 'computational market', 
akin to an auction, guide the interactions among those components. Such a market is 
defined as any structure that allows the components of the system to exchange infor- 
mation on relative valuation of resources (as in an auction), establish equilibrium states 
{e.g., determine market clearing prices) and exchange resources (i.e., engage in trades). 

Such computational economies can be used to investigate real economies and biological 
systems |3^, ^ |149| ]. They can also be used to design distributed computational 
systems. For example, such computational economies are well-suited to some distributed 
resource allocation problems, where each component of the system can either directly 
produce the "goods" it needs or acquire them through trades with other components. 
Computational markets often allow for far more heterogeneity in the components than 
do conventional resource allocation schemes. Furthermore, there is both theoretical and 
empirical evidence suggesting that such markets are often able to settle to equilibrium 
states. For example, auctions find prices that satisfy both the seller and the buyer which 
results in an increase in the utility of both (else one or the other would not have agreed 
to the sale). Assuming that all parties are free to pursue trading opportunities, such 
mechanisms move the system to a point where all possible bilateral trades that could 
improve the utility of both parties are exhausted. 

Now restrict attention to the case, implicit in much of computational market work, 
with the following characteristics: First, world utility can be expressed as a monotonically 
increasing function F where each argument i oi F can in turn be interpreted as the value 
of a pre-specified utility function /j for agent i. Second, each of those fi is a function 
of an i-indexed 'goods vector' rcj of the non-perishable goods "owned" by agent i. The 
components of that vector are Xjj, and the overall system dynamics is restricted to 
conserve the vector X^i^ij- (There are also some other, more technical conditions.) As 
an example, the resource allocation problem can be viewed as concerning such vectors 
of "owned" goods. 

Due to the second of our two conditions, one can integrate a market-clearing mecha- 
nism into any system of this sort. Due to the first condition, since in a market equilibrium 
with non-perishable goods no (rational) agent ends up with a value of its utility function 
lower than the one it started with, the value of the world utility function must be higher 
at equilibrium than it was initially. In fact, so long as the individual agents are smart 
enough to avoid all trades in which they do not benefit, any computational market can 
only improve this kind of world utility, even if it does not achieve the market equilibrium. 
(See the discussion of "weak triviality" below.) 
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This line of reasoning provides one of the main reasons to use computational markets 
when they can be applied. Conversely, it underscores one of the major limitations of 
such markets: Starting with an arbitrary world utility function with arbitrary dynamical 
restrictions, it may be quite difficult to cast that function as a monotonically increasing 
F taking as arguments a set of agents' goods- vector-based utilities /j, if we require that 
those fi be well-enough behaved that we can reasonably expect the agents to optimize 
them in a market setting. 

One example of a computational economy being used for resource allocation is Hu- 
berman and Clearwater's use of a double-blind auction to solve the complex task of 
controlling the temperature of a building. In this case, each agent (individual temper- 
ature controller) bids to buy or sell cool or warm air. This market mechanism leads to 
an equitable temperature distribution in the system |135|. Other domains where market 
mechanisms were successfully applied include purchasing memory in an operating sys- 
tems [p^], allocating virtual circuits [pOl, "stealing" unused CPU cycles in a network of 



computers W^, 26£], predicting option futures in financial markets [214|, and numerous 



scheduling and distributed resource allocation problems [|159| , |165| , |245| , |255| , 275 , 276 ]. 

Computational economics can also be used for tasks not tightly coupled to resource 
allocation. For example, following the work of Maes |174] and Ferber Baum shows 
how by using computational markets a large number of agents can interact and cooperate 
to solve a variant of the blocks world problem ||2^, 24 1. 

Viewed as candidate COINs, all market-based computational economics fall short in 
relying on both centralized communication and centralized control to some degree. Often 
that reliance is extreme. For example, the systems investigated by Baum not only have 
the centralized control of a market, but in addition have centralized control of all other 
non-market aspects of the system. (Indeed, the market is secondary, in that it is only 
used to decide which single expert among a set of candidate experts gets to exert that 
centralized control at any given moment). There has also been doubt cast on how well 
computational economies perform in practice |264], and they also often require extensive 
hand-tailoring in practice. 

Finally, return to consideration of a world utility function that is a monotonically 
increasing function / whose arguments are the utilities of the agents. In general, the 
maximum of such a world utility function will be a Pareto optimal point. So given the 
utility functions of the agents, by considering all such / we map out an infinite set S 
of Pareto optimal points that maximize some such world utility function. (S is usually 
infinite even if we only consider maximizing those world utilities subject to an overall 
conservation of goods constraint.) Now the market equilibrium is a Pareto optimal point, 
and therefore lies in S. But it is only one element of S. Moreover, it is usually set in 
full by the utilities of the agents, in concert with the agents' initial endowments. In 
particular, it is independent of the world utility. In general then, given the utilities of 
the agents and a world utility /, there is no a priori reason to believe that the particular 
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element in 5 picked out by the auction is the point that maximizes that particular world 
utility. This subtlety is rarely addressed in the work on using computational markets 
to achieve a global goal. It need not be uncircumventable however. For example, one 
obvious idea would be to to try to distort the agents' perceptions of their utility functions 
and/or initial endowments so that the resultant market equilibrium has a higher value 
of the world utility at hand]^ 



3.2.4 Perfect Rationality Noncooperative Game Theory 

Game theory is the branch of mathematics concerned with formalized versions of "games" 



in the sense of chess, poker, nuclear arms races, and the like |31, 101, 87, 242, 75, [170| , 



2C, IC]. It is perhaps easiest to describe it by loosely defining some of its terminology. 



which we do here and in the next subsection. 

The simplest form of a game is that of 'non-cooperative single-stage extensive-form' 
game, which involves the following situation: There are two or more agents (called 
'players' in the literature), each of which has a pre-specified set of possible actions that 
it can follow. (A 'finite' game has finite sets of possible actions for all the players.) 
In addition, each agent i has a utility function (also called a 'payoff matrix' for finite 
games). This maps any 'profile' of the action choices of all agents to an associated utility 
value for agent i. (In a 'zero-sum' game, for every profile, the sum of the payoffs to all 
the agents is zero.) 

The agents choose their actions in a sequence, one after the other. The structure 
determining what each agent knows concerning the action choices of the preceding agents 
is known as the 'information set'.|^ Games in which each agent knows exactly what the 
preceding ('leader') agent did are known as 'Stackelberg games'. (A variant of such a 
game is considered in our experiments below. See also |153| .) 

In a 'multi-stage' game, after all the agents choose their first action, each agent is 
provided some information concerning what the other agents did. The agent uses this 



^In fact, the second theorem of welfare economics [25C] states that given any world utility such that: 
i) its global maximum can be written as a Pareto optimal point for agents' utilities all of whose level sets 
are convex; ii) no other maximum of that world utility is Pareto optimal for those agents' utilities; then 
one can always set initial endowments of the agents so that that Pareto optimal point corresponds to 
the price-clearing point for those endowments. Note though that that setting of endowments requires a 
centralized process. Moreover, even if we are allowed such a process to set the endowments, we still may 
not be able to successfully exploit this theorem to arrive at the world utility maximum, if we use markets 
involving iterative trading with dynamic associated prices, like those in the real world. This is because 
an intermediate trade with "incorrect" prices may have resulted in some particular agent's having its 
utility rise beyond the level it has at the point maximizing world utility, and since that agent will never 
afterward engage in a trade that diminishes its utility, the system will never arrive at the world utility 



maximum 

5 



While stochastic choices of actions is central to game theory, most of the work in the field assumes 
the information in information sets is in the form of definite facts, rather than a probability distribution. 
Accordingly, there has been relatively little work incorporating Shannon information theory into the 
analysis of information sets. 
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information to choose its next action. In the usual formulation, each agent gets its payoff 
at the end of all of the game's stages. 

An agent's 'strategy' is the rule it elects to follow mapping the information it has at 
each stage of a game to its associated action. It is a 'pure strategy' if it is a deterministic 
rule. If instead the agent's action is chosen by randomly sampling from a distribution, 
that distribution is known a 'mixed strategy'. Note that an agent's strategy concerns 
all possible sequences of provided information, even any that cannot arise due to the 
strategies of the other agents. 

Any multi-stage extensive-form game can be converted into a 'normal form' game, 
which is a single-stage game in which each agent is ignorant of the actions of the other 
agents, so that all agents choose their actions "simultaneously". This conversion is 
acieved by having the "actions" of each agent in the normal form game correspond to 
an entire strategy in the associated multi-stage extensive-form game. The payoffs to all 
the agents in the normal form game for a particular strategy profile is then given by the 
associated payoff matrices of the multi-stage extensive form-game. 

A 'solution' to a game, or an 'equilibrium', is a profile in which every agent behaves 
"rationally". This means that every agent's choice of strategy optimizes its utility sub- 
ject to a pre-specified set of conditions. In conventional game theory those conditions 
involve, at a minimum, perfect knowledge of the payoff matrices of all other players, and 
often also involve specification of what strategies the other agents adopted and the like. 
In particular, a 'Nash equilibrium' is a a profile where each agent has chosen the best 
strategy it can, given the choices of the other agents. A game may have no Nash equilib- 
ria, one equilibrium, or many equilibria in the space of pure strategies. A beautiful and 
seminal theorem due to Nash proves that every game has at least one Nash equilibrium 



in the space of mixed strategies |19£]. 

There are several different reasons one might expect a game to result in a Nash equi- 
librium. One is that it is the point that perfectly rational Bayesian agents would adopt, 
assuming the probability distributions they used to calculate expected payoffs were con- 
sistent with one another |^, 143 1. A related reason, arising even in a non-Bayesian 
setting, is that a Nash equilibrium equilibrium provides "consistent" predictions, in that 
if all parties predict that the game will converge to a Nash equilibrium, no one will 
benefit by changing strategies. Having a consistent prediction does not ensure that all 
agents' payoffs are maximized though. The study of small perturbations around Nash 
equilibria from a stochastic dynamics perspective is just one example of a 'refinement' of 
Nash equilibrium, that is a criterion for selecting a single equilibrium state when more 
than one is present [ |176| |. 

In cooperative game theory the agents are able to enter binding contracts with one 
another, and thereby coordinate their strategies. This allows the agents to avoid being 
"stuck" in Nash equilibria that are Pareto inefficient, that is being stuck at equilibrium 
profiles in which all agents would benefit if only they could agree to all adopt different 
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strategies, with no possibility of betrayal. The characteristic function of a game involves 
subsets ('coalitions') of agents playing the game. For each such subset, it gives the 
sum of the payoffs of the agents in that subset that those agents can guarantee if they 
coordinate their strategies. An imputation is a division of such a guaranteed sum among 
the members of the coalition. It is often the case that for a subset of the agents in a 
coalition one imputation dominates another, meaning that under threat of leaving the 
coalition that subset of agents can demand the first imputation rather than the second. 
So the problem each agent i is confronted with in a cooperative game is which set of other 
agents to form a coalition with, given the characteristic function of the game and the 
associated imputations i can demand of its partners. There are several different kinds 
of solution for cooperative games that have received detailed study, varying in how the 
agents address this problem of who to form a coalition with. Some of the more popular 
are the 'core', the 'Shapley value', the 'stable set solution', and the 'nucleolus'. 

In the real world, the actual underlying game the agents are playing does not only 
involve the actions considered in cooperative game theory's analysis of coalitions and 
imputations. The strategies of that underlying game also involve bargaining behavior, 
considerations of trying to cheat on a given contract, bluffing and threats, and the like. 
In many respects, by concentrating on solutions for coalition formation and their relation 
with the characteristic function, cooperative game theory abstracts away these details 
of the true underlying game. Conversely though, progress has recently been made in 
understanding how cooperative games can arise from non-cooperative games, as they 
must in the real world [10|. 



3.3 Evolution and Learning in Games 

Not surprisingly, game theory has come to play a large role in the field of multi-agent 
systems. In addition, due to Darwinian natural selection, one might expect game theory 
to be quite important in population biology, in which the "utility functions" of the 
individual agents can be taken to be their reproductive fitness. As it turns out, there is 
an entire subfield of game theory concerned with this connection with population biology, 
called 'evolutionary game theory' |178| , [L81| . 

To introduce evolutionary game theory, consider a game in which all players share the 
same space of possible strategies, and there is an additional space of possible 'attribute 
vectors' that characterize an agent, along with a probability distribution g across that 
new space. (Examples of attributes in the physical world could be things like size, 
speed, etc.) We select a set of agents to play a game by randomly sampling g. Those 
agents' attribute vectors jointly determine the payoff matrices of each of the individual 
agents. (Intuitively, what benefit accrues to an agent for taking a particular action 
depends on its attributes and those of the other agents.) However each agent i has 
limited information concerning both its attribute vector and that of the other players 
in the game, information encapsulated in an 'information structure'. The information 
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structure specifies how much each agent knows concerning the game it is playing. 

In this context, we enlarge the meaning of the term "strategy" to not just be a 
mapping from information sets and the like to actions, but from entire information 
structures to actions. In addition to the distribution g over attribute vectors, we also 
have a distribution over strategies, h. A strategy s is a 'population strategy' if /i is a 
delta function about s. Intuitively, we have a population strategy when each animal in a 
population "follows the same behavioral rules" , rules that take as input what the animal 
is able to discern about its strengths and weakness relative to those other members of 
the population, and produce as output how the animal will act in the presence of such 
animals. 

Given g, a population strategy centered about s, and its own attribute vector, any 
player i in the support of g has an expected payoff for any strategy it might adopt. When 
i's payoff could not improve if it were to adopt any strategy other than s, we say that s 
is 'evolutionary stable'. Intuitively, an evolutionary stable strategy is one that is stable 
with respect to the introduction of mutants into the population. 

Now consider a sequence of such evolutionary games. Interpret the payoff that any 
agent receives after being involved in such a game as the 'reproductive fitness' of that 
agent, in the biological sense. So the higher the payoff the agent receives, in comparison 
to the fitnesses of the other agents, the more "offspring" it has that get propagated 
to the next game. In the continuum-time limit, where games are indexed by the real 
number t, this can be formalized by a differential equation. This equation specifies the 
derivative of gt evaluated for each agent i's attribute vector, as a montonically increasing 
function of the relative difference between the payoff of i and the average payoff of all 
the agents. (We also have such an equation for h.) The resulting dynamics is known as 
'replicator dynamics', with an evolutionary stable population strategy, if it exists, being 
one particular fixed point of the dynamics. 

Now consider removing the reproductive aspect of evolutionary game theory, and 
instead have each agent propagate to the next game, with "memory" of the events of 
the preceding game. Furthermore, allow each agent to modify its strategy from one 
game to the next by "learning" from its memory of past games, in a bounded rational 
manner. The field of learning in games is concerned with exactly such situations [100, 11 



17 , 27, 8C, 146, 205 , 202]. Most of the formal work in this field involves simple models 



for the learning process of the agents. For example, in 'ficticious play' [|100f| , in each 
successive game, each agent i adopts what would be its best strategy if its opponents 
chose their strategies according to the empirical frequency distribution of such strategies 
that i has encountered in the past. More sophisticated versions of this work employ 
simple Bayesian learning algorithms, or re- inventions of some of the techniques of the 
RL community |222| ]. Typically in learning in games one defines a payoff to the agent 
for a sequence of games, for example as a discounted sum of the payoffs in each of 
the constituent games. Within this framework one can study the long term effects of 
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strategies such as cooperation and see if they arise naturally and if so, under what 
circumstances. 

Many aspects of real world games that do not occur very naturally otherwise arise 
spontaneously in these kinds of games. For example, when the number of games to be 
played is not pre-fixed, it may behoove a particular agent i to treat its opponent better 
than it would otherwise, since i may have to rely on that other agent's treating it well 
in the future, if they end up playing each other again. This framework also allows us to 
investigate the dependence of evolving strategies on the amount of information available 



to the agents |184]; the effect of communication on the evolution of cooperation [185, 187 1 



and the parallels between auctions and economic theory |123, 186|. 

In many respects, learning in games is even more relevant to the study of COINs than 
is traditional game theory. However it suffers from the same major shortcoming; it is 
almost exclusively focused on the forward problem rather than the inverse problem. In 
essence, COIN design is the problem of inverse game theory. 

3.3.1 El Parol Bar Problem 

The "El Farol" bar problem and its variants provide a clean and simple testbed for 
investigating certain kinds of interactions among agents 0, pO, 241 1. In the original 



version of the problem, which arose in economics, at each time step (each "night"), 
each agent needs to decide whether to attend a particular bar. The goal of the agent in 
making this decision depends on the total attendance at the bar on that night. If the total 
attendance is below a preset capacity then the agent should have attended. Conversely, 
if the bar is overcrowded on the given night, then the agent should not attend. (Because 
of this structure, the bar problem with capacity set to 50% of the total number of agents 
is also known as the 'minority game'; each agent selects one of two groups at each time 
step, and those that are in the minority have made the right choice). The agents make 
their choices by predicting ahead of time whether the attendance on the current night 
will exceed the capacity and then taking the appropriate course of action. 

What makes this problem particularly interesting is that it is impossible for each 
agent to be perfectly "rational", in the sense of correctly predicting the attendance on 
any given night. This is because if most agents predict that the attendance will be low 
(and therefore decide to attend), the attendance will actually high, while if they predict 
the attendance will be high (and therefore decide not to attend) the attendance will be 
low. (In the language of game theory, this essentially amounts to the property that there 



are no pure strategy Nash equilibria |52, |292| ].) Alternatively, viewing the overall system 
as a COIN, it has a Prisoner's Dilemma-like nature, in that "rational" behavior by all 
the individual agents thwarts the global goal of maximizing total enjoyment (defined as 
the sum of all agents' enjoyment and maximized when the bar is exactly at capacity). 

This frustration effect is similar to what occurs in spin glasses in physics, and makes 
the bar problem closely related to the physics of emergent behavior in distributed sys- 
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terns 50, 51, 295]. Researchers have also studied the dynamics of the bar problem 
to investigate economic properties like competition, cooperation and collective behavior 
and especially their relationship to market efficiency 141, |232| ] . 



3.4 Biologically Inspired Systems 

Properly speaking, biological systems do not involve utility functions and searches across 
them with RL algorithms. However it has long been appreciated that there are many 
ways in which viewing biological systems as involving searches over such functions can 
lead to deeper understanding of them [ p38| , |290| ]. Conversely, some have argued that the 
mechanism underlying biological systems can be used to help design search algorithms 

iii-a 

These kinds of reasoning which relate utility functions and biological systems have 
traditionally focussed on the case of a single biological system operating in some external 
environment. If we extend this kind of reasoning, to a set of biological systems that 
are co-evolving with one another, then we have essentially arrived at biologically-based 
COINs. This section discusses some of how previous work in the literature bears on this 
relationship between COINs and biology. 



3.4.1 Population Biology and Ecological Modeling 

The fields of population biology and ecological modeling are concerned with the large- 
scale "emergent" processes that govern the systems that consist of many (relatively) 



simple entities interacting with one another [25, 116|. As usually cast, the "simple en- 
tities" are members of one or more species, and the interactions are some mathematical 
abstraction of the process of natural selection as it occurs in biological systems (involving 
processes like genetic reproduction of various sorts, genotype-phenotype mappings, inter 
and intra-species competitions for resources, etc.). Population Biology and ecological 
modeling in this context addresses questions concerning the dynamics of the resultant 
ecosystem, and in particular how its long-term behavior depends on the details of the 
interactions between the constituent entities. Broadly construed, the paradigm of ecolog- 
ical modeling can even be broadened to study how natural selection and self-regulating 



feedback creates a stable planet-wide ecological environment — Gaia | 167 ]. 

The underlying mathematical models of other fields can often be usefully modified 
to apply to the kinds of systems population biology is interested in iQ. (See also the 
discussion in the game theory subsection above.) Conversely, the underlying mathe- 
matical models of population biology and ecological modeling can be applied to other 
non-biological systems. In particular, those models shed light on social issues such as 



the emergence of language or culture, warfare, and economic competition |81, 82, 102| 



See [L73, 281:] though for some counter-arguments to the particular claims most commonly made in 
this regard. 
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They also can be used to investigate more abstract issues concerning the behavior of 



large complex systems with many interacting components |103| , |114| , |179| , P03| , pi!: 

Going a bit further afield, an approach that is related in spirit to ecological modeling is 
'computational ecologies'. These are large distributed systems where each component of 
the system's acting (seemingly) independently results in complex global behavior. Those 
components are viewed as constituting an "ecology" in an abstract sense (although much 
of the mathematics is not derived from the traditional field of ecological modeling). 
In particular, one can investigate how the dynamics of the ecology is influenced by 
the information available to each component and how cooperation and communication 
among the components affects that dynamics |134, 13(3]. 

Although in some ways the most closely related to COINs of the current ecology- 
inspired research, the field of computational ecologies has some significant shortcomings 
if one tries to view it as a full science of COINs. In particular, it suffers from not 
being designed to solve the inverse problem of how to configure the system so as to 
arrive at a particular desired dynamics. This is a difficulty endemic to the general 
program of equating ecological modeling and population biology with the science of 
COINs. These fields are primarily concerned with the "forward problem" of determining 
the dynamics that arises from certain choices of the underlying system. Unless one's 
desired dynamics is sufficiently close to some dynamics that was previously catalogued 
(during one's investigation of the forward problem), one has very little information on 
how to set up the components and their interactions to achieve that desired dynamics. In 
addition, most of the work in these fields does not involve RL algorithms, and viewed as a 
context in which to design COINs suffers from a need for hand-tailoring, and potentially 
lack of robustness and scalability. 



3.4.2 Swarm Intelligence 

The field of 'swarm intelligence' is concerned with systems that are modeled after so- 
cial insect colonies, so that the different components of the system are queen, worker, 
soldier, etc. It can be viewed as ecological modeling in which the individual entities 
have extremely limited computing capacity and/or action sets, and in which there are 
very few types of entities. The premise of the field is that the rich behavior of social 
insect colonies arises not from the sophistication of any individual entity in the colony, 
but from the interaction among those entities. The objective of current research is to 
uncover kinds of interactions among the entity types that lead to pre-specified behavior 
of some sort. 

More speculatively, the study of social insect colonies may also provide insight into 
how to achieve learning in large distributed systems. This is because at the level of the 
individual insect in a colony, very little (or no) learning takes place. However across 
evolutionary time-scales the social insect species as a whole functions as if the various 
individual types in a colony had "learned" their specific functions. The "learning" is the 
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direct result of natural selection. (See the discussion on this topic in the subsection on 
ecological modeling.) 

Swarm intelligences have been used to adaptively allocate tasks in a mail com- 
pany IIJ], solve the traveling salesman problem |6^, |7^ and route data efficiently in 
dynamic networks |3^, 236, 257| among others. Despite this, such intelligences do not 
really constitute a general approach to designing COINs. There is no general framework 
for adapting swarm intelligences to maximize particular world utility functions. Accord- 
ingly, such intelligences generally need to be hand-tailored for each application. And 
after such tailoring, it is often quite a stretch to view the system as "biological" in any 
sense, rather than just a simple and a priori reasonable modification of some previously 
deployed system. 



3.4.3 Artificial Life 

The two main objectives of Artificial Life, closely related to one another, are under- 
standing the abstract functioning and especially the origin of terrestrial life, and creating 



organisms that can meaningfully be called "alive" |163]. 

The first objective involves formalizing and abstracting the mechanical processes un- 
derpinning terrestrial life. In particular, much of this work involves various degrees of 
abstraction of the process of self-replication | 
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Some of the more real- 
world-oriented work on this topic involves investigating how lipids assemble into more 
complex structures such as vesicles and membranes, which is one of the fundamental 



questions concerning the origin of life ||6^, 77, 205, 212, 201 1. Many computer models 
have been proposed to simulate this process, though most suffer from overly simplifying 
the molecular morphology. 

More generally, work concerned with the origin of life can constitute an investigation of 



the functional self-organization that gives rise to life |18C]. In this regard, an important 
early work on functional self-organization is the lambda calculus, which provides an 
elegant framework (recursively defined functions, lack of distinction between object and 



function, lack of architectural restrictions) for studying computational systems |56|. This 
framework can be used to develop an artificial chemistry "function gas" that displays 
complex cooperative properties [p^j . 

The second objective of the field of Artificial Life is less concerned with understand- 
ing the details of terrestrial life per se than of using terrestrial life as inspiration for 
how to design living systems. For example, motivated by the existence (and persistence) 
of computer viruses, several workers have tried to design an immune system for com- 
puters that will develop "antibodies" and handle viruses both more rapidly and more 
efficiently than other algorithms 145, 248]. More generally, because we only have 



one sampling point (life on Earth), it is very difficult to precisely formulate the process 
by which life emerged. By creating an artificial world inside a computer however, it is 

i& where the 



possible to study far more general forms of life |217, 218, 219]. See also 
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argument is presented that the richest way of approaching the issue of defining "hfe" is 
phenomenologicahy, in terms of self-dissimilar scaling properties of the system. 



3.4.4 Training cellular automata with genetic algorithms 



Cellular automata can be viewed as digital abstractions of physical gases pq, pq, 277 



278 1 . Formally, they are discrete-time recurrent neural nets where the neurons live on a 
grid, each neuron has a finite number of potential states, and inter-neuron connections 
are (usually) purely local. (See below for a discussion of recurrent neural nets.) So the 
state update rule of each neuron is fixed and local, the next state of a neuron being a 
function of the current states of it and of its neighboring elements. 

The state update rule of (all the neurons making up) any particular cellular automaton 
specifies the mapping taking the initial configuration of the states of all of its neurons 
to the final, equilibrium (perhaps strange) attractor configuration of all those neurons. 
So consider the situation where we have a desired such mapping, and want to know an 
update rule that induces that mapping. This is a search problem, and can be viewed as 
similar to the inverse problem of how to design a COIN to achieve a pre-specified global 
goal, albeit a "COIN" whose nodal elements do not use RL algorithms. 

Genetic algorithms are a special kind of search algorithm, based on analogy with the 



biological process of natural selection via recombination and mutation of a genome [190|. 
Although genetic algorithms (and 'evolutionary computation' in general) have been stud- 
ied quite extensively, there is no formal theory justifying genetic algorithms as search 



algorithms [172, 285 1 and few empirical comparisons with other search techniques. One 
example of a well-studied application of genetic algorithms is to (try to) solve the in- 
verse problem of finding update rules for a cellular automaton that induce a pre-specified 
mapping from its initial configuration to its attractor configuration. To date, they have 
used this way only for extremely simple configuration mappings, mappings which can be 
trivially learned by other kinds of systems. Despite the simplicity of these mappings, the 
use of genetic algorithms to try to train cellular automata to exhibit them has achieved 



little success M, 62, 191, 192]. 



3.5 Physics-Based Systems 
3.5.1 Statistical Physics 

Equilibrium statistical physics is concerned with the stable state character of large num- 
bers of very simple physical objects, interacting according to well-specified local deter- 
ministic laws, with probabilistic noise processes superimposed [||, |220| |. Typically there is 
no sense in which such systems can be said to have centralized control, since all particles 
contribute comparably to the overall dynamics. 

Aside from mesoscopic statistical physics, the numbers of particles considered are usu- 
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ally huge {e.g., 10^^), and the particles themselves are extraordinarily simple, typically 
having only a few degrees of freedom. Moreover, the noise processes usually considered 
are highly restricted, being those that are formed by "baths", of heat, particles, and the 
like. Similarly, almost all of the field restricts itself to deterministic laws that are read- 
ily encapsulated in Hamilton's equations (Schrodinger's equation and its field-theoretic 
variants for quantum statistical physics). In fact, much of equilibrium statistical physics 
isn't even concerned with the dynamic laws by themselves (as for example is stochastic 
Markov processes). Rather it is concerned with invariants of those laws {e.g., energy), 
invariants that relate the states of all of the particles. Trivially then, deterministic 
laws without such readily-discoverable invariants are outside of the purview of much of 
statistical physics. 

One potential use of statistical physics for COINs involves taking the systems that 
statistical physics analyzes, especially those analyzed in its condensed matter variant 



{e.g., spin glasses p52| , 253 |), as simplified models of a class of COINs. This approach 



is used in some of the analysis of the Bar problem (see above). It is used more overtly 



in (for example) the work of Galam |104| , in which the equilibrium coalitions of a set of 
"countries" are modeled in terms of spin glasses. This approach cannot provide a general 
COIN framework though. In addition to the restrictions listed above on the kinds of 
systems it considers, this is due to its not providing a general solution to arbitrary COIN 
inversion problems, and to its not employing RL algorithms.]] 

Another contribution that statistical physics can make is with the mathematical tech- 
niques it has developed for its own purposes, like mean field theory, self-averaging ap- 
proximations, phase transitions, Monte Carlo techniques, the replica trick, and tools to 
analyze the thermodynamic limit in which the number of particles goes to infinity. Al- 
though such techniques have not yet been applied to COINs, they have been successfully 
applied to related fields. This is exemplified by the use of the replica trick to analyze 
two-player zero-sum games with random payoff matrices in the thermodynamic limit of 
the number of strategies in |28|. Other examples are the numeric investigation of iter- 
ated prisoner's dilemma played on a lattice |26l|], the analysis of stochastic games by 



expressing of deviation from rationality in the form of a "heat bath" |17(:], and the use 
of topological entropy to quantify the complexity of a voting system studied in [182|. 

Other quite recent work in the statistical physics literature is formally identical to that 
in other fields, but presents it from a novel perspective. A good example of this is [246|, 
which is concerned with the problem of controlling a spatially extended system with a 
single controller, by using an algorithm that is identical to a simple-minded proportional 

^In regard to the latter point however, it's interesting to speculate about recasting statistical physics 
as a COIN, by viewing each of the particles in the physical system as running an "RL algorithm" that 
perfectly optimizes the "utility function" of its Lagrangian, given the "actions" of the other particles. 
In this perspective, many-particle physical systems are multi-stage games that are at Nash equilibrium 
in each stage. So for example, a frustrated spin glass is such a system at a Nash equilibrium that is not 
Pareto optimal. 
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RL algorithm (in essence, a rediscovery of RL). 



3.5.2 Action Extremization 



Much of the theory of physics can be cast as solving for the extremization of an actional, 
which is a functional of the worldline of an entire (potentially many-component) sys- 
tem across all time. The solution to that extremization problem constitutes the actual 
worldline followed by the system. In this way the calculus of variations can be used to 
solve for the worldline of a dynamic system. As an example, simple Newtonian dynamics 
can be cast as solving for the worldline of the system that extremizes a quantity called 
the 'Lagrangian', which is a function of that worldline and of certain parameters {e.g., 
the 'potential energy') governing the system at hand. In this instance, the calculus of 
variations simply results in Newton's laws. 

If we take the dynamic system to be a COIN, we are assured that its worldline 
automatically optimizes a "global goal" consisting of the value of the associated actional. 
If we change physical aspects of the system that determine the functional form of the 
actional {e.g., change the system's potential energy function), then we change the global 
goal, and we are assured that our COIN optimizes that new global goal. Counter-intuitive 
physical systems, like those that exhibit Braess' paradox [^], are simply systems for 
which the "world utility" implicit in our human intuition is extremized at a point different 
from the one that extremizes the system's actional. 

The challenge in exploiting this to solve the COIN design problem is in translating 
an arbitrary provided global goal for the COIN into a parameterized actional. Note that 
that actional must govern the dynamics of the physical COIN, and the parameters of the 
actional must be physical variables in the COIN, variables whose values we can modify. 



3.5.3 Active Walker Models 



The field of active walker models 117, |118| ] is concerned with modeling "walkers" (be 
they human walkers or instead simple physical objects) crossing fields along trajectories, 
where those trajectories are a function of several factors, including in particular the 
trails already worn into the field. Often the kind of trajectories considered are those 
that can be cast as solutions to actional extremization problems so that the walkers can 
be explicitly viewed as agents optimizing a private utility. 

One of the primary concerns with the field of active walker models is how the trails 
worn in the field change with time to reach a final equilibrium state. The problem 
of how to design the cement pathways in the field (and other physical features of the 
field) so that the final paths actually followed by the walkers will have certain desirable 
characteristics is then one of solving for parameters of the actional that will result in the 
desired worldline. This is a special instance of the inverse problem of how to design a 
COIN. 
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Using active walker models this way to design COINs, like action extremization in 
general, probably has limited applicability. Also, it is not clear how robust such a design 
approach might be, or whether it would be scalable and exempt from the need for hand- 
tailoring. 



3.6 Other Related Subjects 

This subsection presents a "catch-all" of other fields that have little in common with one 
another except that they bear some relation to COINs. 



3.6.1 Stochastic Fields 



An extremely well-researched body of work concerns the mathematical and numeric 
behavior of systems for which the probability distribution over possible future states 
conditioned on preceding states is explicitly provided. This work involves many aspects 



of Monte Carlo numerical algorithms POO], all of Markov Chains 204, p54[|, and 



especially Markov fields, a topic that encompasses the Chapman-Kolmogorov equations 



105 1 and its variants: Liouville's equation, the Fokker-Plank equation, and the Detailed- 



balance equation in particular. Non- linear dynamics is also related to this body of work 
(see the synopsis of iterated function systems below and the synopsis of cellular automata 
above), as is Markov competitive decision processes (see the synopsis of game theory 
above). 

Formally, one can cast the problem of designing a COIN as how to fix each of the 
conditional transition probability distributions of the individual elements of a stochastic 
field so that the aggregate behavior of the overall system is of a desired form.^ Unfor- 
tunately, almost all that is known in this area instead concerns the forward problem, of 
inferring aggregate behavior from a provided set of conditional distributions. Although 
such knowledge provides many "bits and pieces" of information about how to tackle the 
inverse problem, those pieces collectively cover only a very small subset of the entire 
space of tasks we might want the COIN to perform. In particular, they tell us very little 
about the case where the conditional distribution encapsulates RL algorithms. 



®In contrast, in the field of Markov decision processes, discussed in [g8|, the fuU system may be a 
Markov field, but the system designer only sets the conditional transition probability distribution of a 
few of the field elements at most, to the appropriate "decision rules". Unfortunately, it is hard to imagine 
how to use the results of this field to design COINs because of major scaling problems. Any decision 
process must accurately model likely future modifications to its own behavior — often an extremely 
daunting task |173l. What's worse, if multiple such decision processes are running concurrently in the 
system, each such process must also model the others, potentially needing to model them in their full 
complexity. 
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3.6.2 Iterated Function Systems 



The technique of iterated function systems |19] grew out of the field of nonlinear dy- 



namics p23| , 256, 262 1 . In such systems a function is repeatedly and recursively applied 
to itself. The most famous example is the logistic map, Xn+i = rx„(l — x„) for some r 
between and 4 (so that x stays between and 1). More generally the function along 
with its arguments can be vector-valued. In particular, we can construct such functions 
out of affine transformations of points in a Euclidean plane. 

Iterated functions systems have been applied to image data. In this case the succes- 
sive iteration of the function generically generates a fractal, one whose precise character 
is determined by the initial iteration- 1 image. Since fractals are ubiquitous in natural 
images, a natural idea is to try to encode natural images as sets of iterated function 
systems spread across the plane, thereby potentially garnering significant image com- 
pression. The trick is to manage the inverse step of starting with the image to be 
compressed, and determining what iteration-1 image(s) and iterating function(s) will 
generate an accurate approximation of that image. 

In the language of nonlinear dynamics, we have a dynamic system that consists of a 
set of iterating functions, together with a desired attr actor (the image to be compressed). 
Our goal is to determine what values to set certain parameters of our dynamic system 
to so that the system will have that desired attractor. The potential relationship with 
COINs arises from this inverse nature of the problem tackled by iterated function sys- 
tems. If the goal for a COIN can be cast as its relaxing to a particular attractor, and 
if the distributed computational elements are isomorphic to iterated functions, then the 
tricks used in iterated functions theory could be of use. 

Although the techniques of iterated function systems might prove of use in designing 
COINs, they are unlikely to serve as a generally applicable approach to designing COINs. 
In addition, they do not involve RL algorithms, and often involve extensive hand-tuning. 

3.6.3 Recurrent Neural Nets 

A recurrent neural net consists of a finite set of "neurons" each of which has a real- valued 
state at each moment in time. Each neuron's state is updated at each moment in time 
based on its current state and that of some of the other neurons in the system. The 
topology of such dependencies constitute the "inter-neuronal connections" of the net, 
and the associated parameters are often called the "weights" of the net. The dynamics 
can be either discrete or continuous {i.e., given by difference or differential equations). 



Recurrent nets have been investigated for many purposes [107, 125, plO| , 293|. One of 
the more famous of these is associative memories. The idea is that given a pre-specified 
pattern for the (states of the neurons in the) net, there may exist inter-neuronal weights 
which result in a basin of attraction focussed on that pattern. If this is the case, then the 
net is equivalent to an associative memory, in that a complete pre-specified pattern across 
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all neurons will emerge under the net's dynamics from any initial pattern that partially 
matches the full pre-specified pattern. In practice, one wishes the net to simultaneously 
possess many such pre-specified associative memories. There are many schemes for 
"training" a recurrent net to have this property, including schemes based on spin glasses 



1 125, 126 , 127 1 and schemes based on gradient descent |224]. 

As can the fields of cellular automata and iterated function systems, the field of 
recurrent neural nets can be viewed as concerning certain variants of COINs. Also like 
those other fields though, recurrent neural nets has shortcomings if one tries to view it 
as a general approach to a science of COINs. In particular, recurrent neural nets do 
not involve RL algorithms, and training them often suffers from scaling problems. More 
generally, in practice they can be hard to train well without hand-tailoring. 

3.6.4 Network Theory 



Packet routing in a data network |29, 129, 24£, 27C, 147, IIC] presents a particularly 



interesting domain for the investigation of COINs. In particular, with such routing: 

(i) the problem is inherently distributed; 

(ii) for all but the most trivial networks it is impossible to employ global control ; 

(iii) the routers have only access to local information (routing tables); 

(iv) it constitutes a relatively clean and easily modified experimental testbed; and 

(v) there are potentially major bottlenecks induced by 'greedy' behavior on the part of 
the individual routers, which behavior constitutes a readily investigated instance of the 
Tragedy Of the Commons (TOC). 



Many of the approaches to packet routing incorporate a variant on RL |41, 46, 54, 



169 , 175 1 . Q-routing is perhaps the best known such approach and is based on routers 
using reinforcement learning to select the best path [^l|. Although generally successful, 
Q-routing is not a general scheme for inverting a global task. This is even true if one 
restricts attention to the problem of routing in data networks — there exists a global 
task in such problems, but that task is directly used to construct the algorithm. 

A particular version of the general packet routing problem that is acquiring increased 
attention is the Quality of Service (QoS) problem, where different communication pack- 
ets (voice, video, data) share the same bandwidth resource but have widely varying 
importances both to the user and (via revenue) to the bandwidth provider. Determining 
which packet has precedence over which other packets in such cases is not only based 
on priority in arrival time but more generally on the potential effects on the income of 
the bandwidth provider. In this context, RL algorithms have been used to determine 
routing policy, control call admission and maximize revenue by allocating the available 



bandwidth efficiently ||46|, |175 |. 



Many researchers have exploited the noncooperative game theoretic understanding 
of the TOC in order to explain the bottleneck character of empirical data networks' 



behavior and suggest potential alternatives to current routing schemes pq, 76, 153, 154 
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162| , p^ , |0|, lo], |243|. Closely related is work on various "pricing" -based resource 



allocation strategies in congestable data networks |171]. This work is at least partially 
based upon current understanding of pricing in toll lanes, and traffic flow in general (see 
below) . All of these approaches are particularly of interest when combined with the RL- 
based schemes mentioned just above. Due to these factors, much of the current research 
on a general framework for COINs is directed toward the packet-routing domain (see 
next section). 



3.6.5 Traffic Theory 

Traffic congestion typifies the TOC public good problem: everyone wants to use the 
same resource, and all parties greedily trying to optimize their use of that resource not 
only worsens global behavior, but also worsens their own private utility {e.g., if everyone 
disobeys traffic lights, everyone gets stuck in traffic jams). Indeed, in the well-known 
Braess' paradox 55, |5^, 155 1, keeping everything else constant — including the 

number and destinations of the drivers — but opening a new traffic path can increase 
everyone's time to get to their destination. (Viewing the overall system as an instance 
of the Prisoner's dilemma, this paradox in essence arises through the creation of a novel 
'defect-defect' option for the overall system.) Greedy behavior on the part of individ- 
uals also results in very rich global dynamic patterns, such as stop and go waves and 
clusters [l20[] . 

Much of traffic theory employs and investigates tools that have previously been ap- 
plied in statistical physics [ |119| , |150 , 151, pll| , |21(]| ] (see subsection above). In particular, 
the spontaneous formation of traffic jams provides a rich testbed for studying the emer- 
gence of complex activity from seemingly chaotic states [119, 121]. Furthermore, the 
dynamics of traffic flow is particular amenable to the application and testing of many 



novel numerical methods in a controlled environment |16, 30, 237]. Many experimental 
studies have conflrmed the usefulness of applying insights gleaned from such work to real 



world traffic scenarios [119, 198, 197]. 



3.6.6 Topics from further afield 

Finally, there are a number of other flelds that, while either still nascent or not extremely 
closely related to COINs, are of interest in COIN design: 

Amorphous computing: Amorphous computing grew out of the idea of replacing 
traditional computer design, with its requirements for high reliability of the components 
of the computer, with a novel approach in which widespread unreliability of those com- 
ponents would not interfere with the computation [Q. Some of its more speculative 
aspects are concerned with "how to program" a massively distributed, noisy system 
of components which may consist in part of biochemical and/or biomechanical compo- 



nents [152, 274 [. Work here has tended to focus on schemes for how to robustly induce 
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desired geometric dynamics across the physical body of the amorphous computer — issue 
that are closely related to morphogenesis, and thereby lend credence to the idea that 
biochemical components are a promising approach. Especially in its limit of computers 
with very small constituent components, amorphous computing also is closely related to 
the fields of nanotechnology |71| and control of smart matter (see below). 

Control of smart matter:. As the prospect of nanotechnology-driven mechanical 
systems gets more concrete, the daunting problem of how to robustly control, power, 
and sustain protean systems made up of extremely large sets of nano-scale devices looms 



more important [111, 112, |122(| . If this problem were to be solved one would in essence 
have "smart matter" . For example, one would be able to "paint" an airplane wing with 
such matter and have it improve drag and lift properties significantly. 

Morphogenesis: How does a leopard embryo get its spots, or a zebra embryo its 
stripes? More generally, what are the processes underlying morphogenesis, in which 
a body plan develops among a growing set of initially undifferentiated cells? These 
questions, related to control of the dynamics of chemical reaction waves, are essentially 
special cases of the more general question of how ontogeny works, of how the genotype- 
phenotype mapping is carried out in development. The answers involve homeobox (as 
well as many other) genes [|l^, 144, |266| ]. Under the presumption that the 
functioning of such genes is at least in part designed to facilitate genetic changes that 
increase a species' fitness, that functioning facilitates solution of the inverse problem, of 
finding small-scale changes (to DNA) that will result in "desired" large scale effects (to 
body plan) when propagated across a growing distributed system. 

Self Organizing systems The concept of self-organization and self-organized crit- 
icality |15] was originally developed to help understand why many distributed physical 
systems are attracted to critical states that possess long-range dynamic correlations in 
the large-scale characteristics of the system. It provides a powerful framework for analyz- 
ing both biological and economic systems. For example, natural selection (particularly 



punctuated equilibrium ||78|, |109|] ) can be likened to self-organizing dynamical system, 
and some have argued it shares many the properties {e.g., scale invariance) of such sys- 
tems 1 63]. Similarly, one can view the economic order that results from the actions of 
human agents as a case of self-organization |^5|. The relationship between complexity 
and self-organization is a particularly important one, in that it provides the potential 
laws that allow order to arise from chaos [114^ 



Small worlds (6 Degrees of Separation): In many distributed systems where each 
component can interact with a small number of "neighbors" , an important problem is how 
to propagate information across the system quickly and with minimal overhead. On the 
one extreme the neighborhood topology of such systems can exist on a completely regular 
grid-like structure. On the other, the topology can be totally random. In either case, 
certain nodes may be effectively 'cut-off' from other nodes if the information pathways 
between them are too long. Recent work has investigated "small worlds" networks 
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(sometimes called 6 degrees of separation) in which underlying grid-like topologies are 
"doped" with a scattering of long-range, random connections. It turns out that very little 
such doping is necessary to allow for the system to effectively circumvent the information 
propagation problem |183|, |273|]. 



Control theory: Adaptive control p, 231], and in particular adaptive control involv- 



ing locally weighted RL algorithms ||8|, |194|] , constitute a broadly applicable framework 



for controlling small, potentially inexactly modeled systems. Augmented by techniques 
in the control of chaotic systems [^, they constitute a very successful way of 

solving the "inverse problem" for such systems. Unfortunately, it is not clear how one 
could even attempt to scale such techniques up to the massively distributed systems 
of interest in COINs. The next section discusses in detail some of the underlying rea- 
sons why the purely model-based versions of these approaches are inappropriate as a 
framework for COINs. 



4 A FRAMEWORK DESIGNED FOR COINs 

Summarizing the discussion to this point, it is hard to see how any already extant 
scientific field can be modified to encompass systems meeting all of the requirements of 
COINs listed at the beginning of Section ^ This is not too surprising, since none of those 
fields were explicitly designed to analyze COINs. This section first motivates in general 
terms a framework that is explicitly designed for analyzing COINs. It then presents the 
formal nomenclature of that framework. This is followed by derivations of some of the 
central theorems of that framework.^ Finally, we present experiments that illustrate the 
power the framework provides for ensuring large world utility in a COIN. 



4.1 Problems with a model-based approach 

What mathematics might one employ to understand and design COINs? Perhaps the 
most natural approach, related to the stochastic fields work reviewed above, involves the 
following three steps: 

1) First one constructs a detailed stochastic model of the COIN's dynamics, a model 
parameterized by a vector 6. As an example, 6 could fix the utility functions of the 
individual agents of the COIN, aspects of their RL algorithms, which agents communicate 
with each other and how, etc. 

2) Next we solve for the function f{6) which maps the parameters of the model to 
the resulting stochastic dynamics. 

3) Cast our goal for the system as a whole as achieving a high expected value of some 
"world utility". Then as our final step we would have to solve the inverse problem: we 



A much more detailed discussion, including intuitive arguments, proofs and fully formal definitions 



of the concepts discussion in this section, can be found in |283| 
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would have to search for a 9 which, via /, results in a high value of E(world utility | 9). 
Let's examine in turn some of the challenges each of these three steps entrain: 

I) We are primarily interested in very large, very complex systems, which are noisy, 
faulty, and often operate in a non-stationary environment. Moreover, our "very complex 
system" consists of many RL algorithms, all potentially quite complicated, all running 
simultaneously. Clearly coming up with a detailed model that captures the dynamics 
of all of this in an accurate manner will often be extraordinarily difficult. Moreover, 
unfortunately, given that the modeling is highly detailed, often the level of verisimilitude 
required of the model will be quite high. For example, unless the modeling of the faulty 
aspects of the system were quite accurate, the model would likely be "brittle" , and overly 
sensitive to which elements of the COIN were and were not operating properly at any 
given time. 

II) Even for models much simpler than the ones called for in (I), solving explicitly for 
the function / can be extremely difficult. For example, much of Markov Chain theory 
is an attempt to broadly characterize such mappings. However as a practical matter, 
usually it can only produce potentially useful characterizations when the underlying 
models are quite inaccurate simplifications of the kinds of models produced in step (I) . 

III) Even if one can write down an /, solving the associated inverse problem is often 
impossible in practice. 

IV) In addition to these difficulties, there is a more general problem with the model- 
based approach. We wish to perform our analysis on a "high level" . Our thesis is that 
due to the robust and adaptive nature of the individual agents' RL algorithms, there 
will be very broad, easily identifiable regions of 9 space all of which result in excellent 
E(world utility | 9), and that these regions will not depend on the precise learning 
algorithms used to achieve the low-level tasks (cf. the list at the beginning of Section |3|). 
To fully capitalize on this, one would want to be able to slot in and out different learning 
algorithms for achieving the low-level tasks without having to redo our entire analysis 
each time. However in general this would be possible with a model-based analysis only 
for very carefully designed models (if at all). The problem is that the result of step 
(3), the solution to the inverse problem, would have to concern aspects of the COIN 
that are (at least approximately) invariant with respect to the precise low-level learning 
algorithms used. Coming up with a model that has this property while still avoiding 
problems (I-IH) is usually an extremely daunting challenge. 

Fortunately, there is an alternative approach which avoids the difficulties of detailed 
modeling. Little modeling of any sort ever is used in this alternative, and what modeling 
does arise has little to do with dynamics. In addition, any such modeling is extremely 
high-level, intented to serve as a decent approximation to almost any system having 
"reasonable" RL algorithms, rather than as an accurate model of one particular system. 

We call any framework based on this alternative a descriptive framework. In 
such a framework one identifies certain salient characteristics of COINs, which are 
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characteristics of a COIN's entire worldline that one strongly expects to find in COINs 
that have large world utility. Under this expectation, one makes the assumption that if 
a COIN is explicitly modified to have the salient characteristics (for example in response 
to observations of its run-time behavior), then its world utility will benefit. So long as 
the salient characteristics are (relatively) easy to induce in a COIN, then this assumption 
provides a ready indirect way to cause that COIN to have large world utility. 

An assumption of this nature is the central leverage point that a descriptive frame- 
work employs to circumvent detailed modeling. Under it, if the salient characteristics 
can be induced with little or no modeling (e.g., via heuristics that aren't rigorously and 
formally justified), then they provide an indirect way to improve world utility without 
recourse to detailed modeling. In fact, since one does not use detailed modeling in a 
descriptive framework, it may even be that one does not have a fully rigorous mathe- 
matical proof that the central assumption holds in a particular system for one's choice of 
salient characteristics. One may have to be content with reasonableness arguments not 
only to justify one's scheme for inducing the salient characteristics, but for making the 
assumption that characteristics are correlated with large world utility in the first place.^ 
Of course, the trick in the descriptive framework is to choose salient characteristics that 
both have a beneficial relationship with world utility and that one expects to be able to 
induce with relatively little detailed modeling of the system's dynamics. 

4.2 Nomenclature 

There exist many ways one might try to design a descriptive framework. In this subsec- 
tion we present nomenclature needed for a (very) cursory overview of one of them. (See 



1 283 1 for a more detailed exposition, including formal proofs.) 

This overview concentrates on the four salient characteristics of intelligence, learn- 
ability, factoredness, and the wonderful life utility, all defined below. Intelligence is a 
quantification of how well an RL algorithm performs. We want to do whatever we can 
to help those algorithms achieve high values of their utility functions. Learnability is 
a characteristic of a utility function that one would expect to be well-correlated with 
how well an RL algorithm can learn to optimize it. A utility function is also factored if 
whenever its value increases, the overall system benefits. Finally, wonderful life utility 
is an example of a utility function that is both learnable and factored. 

After the preliminary definitions below, this section formalizes these four salient char- 
acteristics, derives several theorems relating them, and illustrates in some computer ex- 
periments how those theorems can be used to help the system achieve high world utility. 



^"Despite only being implicit, such reasonableness arguments are all that underpins fields like non- 



Bayesian learning. See §79|, li^, |80 
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4.2.1 Preliminary Definitions 

1) We refer to an RL algorithm by which an individual component of the COIN modifies 
its behavior as a microlearning algorithm. We refer to the initial construction of the 
COIN, potentially based upon salient characteristics, as the COIN initialization. We 
use the phrase macrolearning to refer to externally imposed run-time modifications to 
the COIN which are based on statistical inference concerning salient characteristics of 
the running COIN. 

2) For convenience, we take time, t, to be discrete and confined to the integers, Z. 
When referring to COIN initialization, we implicitly have a lower bound on t, which 
without loss of generality we take to be less than or equal to 0. 

3) All variables that have any effect on the COIN arc identified as components of 
Euclidean-vector-valued states of various discrete nodes. As an important example, if 
our COIN consists in part of a computational "agent" running a microlearning algorithm, 
the precise configuration of that agent at any time t, including all variables in its learning 
algorithm, all actions directly visible to the outside world, all internal parameters, all 
values observed by its probes of the surrounding environment, etc., all constitute the 
state vector of a node representing that agent. We define ("^^ ^ to be a vector in the 
Euclidean vector space Z^^^, where the components of ("^ ^ give the state of node rj at 
time t. The i'th component of that vector is indicated by C^^.-- 

Observation 3.1: In practice, many COINs will involve variables that are most nat- 
urally viewed as discrete and symbolic. In such cases, we must exercise some care in 
how we choose to represent those variables as components of Euclidean vectors. There 
is nothing new in this; the same issue arises in modern work on applying neural nets to 
inherently symbolic problems. In our COIN framework, we will usually employ the same 
resolution of this issue employed in neural nets, namely representing the possible values 
of the discrete variable with a unary representation in a Euclidean space. Just as with 
neural nets, values of such vectors that do not lie on the vertices of the unit hypercube 
are not meaningful, strictly speaking. Fortunately though, just as with neural nets, there 
is almost always a most natural way to extend the definitions of any function of interest 
(like world utility) so that it is well-defined even for vectors not lying on those vertices. 
This allows us to meaningfully define partial derivatives of such functions with respect 
to the components of (, partial derivatives that we will evaluate at the corners of the 
unit hypercube. 

4) For notational convenience, we define C ^ G Z ^ to be the vector of the states of all 
nodes at time t; C- ^ £ — v,t vector of the states of all nodes other than rj at 
time t; and C = C G Z = Z' to be the entire vector of the states of all nodes at all times. 
Z is infinite-dimensional in general, and usually assumed to be a Hilbert space. We will 
often assume that all spaces Z j over all times t are isomorphic to a space Z*-''^ i.e., Z is 
a Cartesian product of copies of Z^°^ . 

Also for notational convenience, we define gradients using (9-shorthand. So for exam- 
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pie, F{( ) is the vector of the partial derivative of F(C ) with respect to the compo- 
nents of C J. Also, we will sometimes treat the symbol "t" specially, as delineating a range 
of components of ^. So for example an expression like "Cj^^/" refers to all components 
C^witht <t'. 

5) To avoid confusion with the other uses of the comma operator, we will often use 
X • y rather than (x, y) to indicate the vector formed by concatenating the two ordered 
sets of vector components x and y. For example, • refers to the vector 
formed by concatenating those components of the worldline C involving node rj for times 
less than with those components involving node rj' that have times greater than 0. 

6) We take the universe in which our COIN operates to be completely deterministic. 
This is certainly the case for any COIN that operates in a digital system, even a system 
that emulates analog and/or stochastic processes {e.g., with a pseudo-random number 
generator). More generally, this determinism reflects the fact that since the real world 
obeys (deterministic) physics, any real-world system, be it a COIN or something else, 
is, ultimately, embedded in a deterministic system]^ 

The perspective to be kept in mind here is that of nonlinear time-series analysis. A 
physical time series typically reflects a few degrees of freedom that are projected out of 
the underlying space in which the full system is deterministically evolving, an underlying 
space that is actually extremely high-dimensional. This projection typically results in 
an illusion of stochasticity in the time series. 

7) Formally, to reflect this determinism, first we bundle all variables we are not 
directly considering — but which nonetheless affect the dynamics of the system — as 
components of some catch-all environment node. So for example any "noise processes" 
and the like affecting the COIN's dynamics are taken to be inputs from a deterministic, 
very high-dimensional environment that is potentially chaotic and is never directly ob- 
served [^]. Given such an environment node, we then stipulate that for all t,t' such 
that t' > t, ( ^ sets ( ^, uniquely. 

Observation 7.1: When nodes are "computational devices", often we must be care- 
ful to specify the physical extent of those devices. Such a node may just be the associated 
CPU, or it may be that CPU together with the main RAM, or it may include an external 
storage device. Almost always, the border of the device r] will end before any external 
system that r/ is "observing" begins. This means that since at time t rj only knows the 
value of C^^, its "observational knowledge" of that external system is indirect. That 
knowledge reflects a coupling between C, ^ and Cr, ^) a coupling that is induced by the 
dynamical evolution of the system from preceding moments up to the time t. If the 
dynamics does not force such a coupling, then r/ has no observational knowledge of the 

^^This determinism holds even for systems with an exphcitly quantum mechanical character. Quantum 
mechanical systems evolve according to Schrodinger's equation, which is purely deterministic; as is now 
well-accepted, the "stochastic" aspect of quantum mechanics can be interpreted as an epiphenomenon of 
Schrodinger's equation that arises when the Hamiltonian has an "observational" or "entangling" coupling 

106], a coupling that does not obviate the underlying determinism. 



between some of its variables [p4 , 297 
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outside world. 

8) We express the dynamics of our system by writing C^,^^ = C'(C^)- (I^i this paper 
there will be no need to be more precise and specify the precise dependency of C(.) on t 
and/or t'.) We define {C} to be a set of constraint equations enforcing that dynamics, 
and also, more generally, fixing the entire manifold C of vectors C ^ Zl that wc consider 
to be 'allowed'. So C is a subset of the set of all (" € Z that arc consistent with 
the deterministic laws governing the COIN, i.e., that obey C^/^ = C'(Cj) V t,t'. We 
generalize this notation in the obvious way, so that (for example) C^t>to is the manifold 
consisting of all vectors C ^^^^ £ ^,t>to ^^^^ projections of a vector in C. 

Observation 8.1: Note that C t>to is parameterized by , due to determinism. Note 
also that whereas C(.) is defined for any argument of the form ( ^ E for some t (i.e., we 
can evolve any point forward in time), in general not all G Z ^ lie in Cj. In particular, 
there may be extra restrictions constraining the possible states of the system beyond 
those arising from its need to obey the relevant dynamical laws of physics. Finally, 
whenever trying to express a COIN in terms of the framework presented here, it is a 
good rule to try to write out the constraint equations explicitly to check that what one 
has identified as the space Z^ contains all quantities needed to uniquely fix the future 
state of the system. 

Observation 8.2: We do not want to have Z be the phase space of every particle 
in the system. We will instead usually have Z consist of variables that, although still 
evolving deterministically, exist at a larger scale of granularity than that of individual 
particles {e.g., thermodynamic variables in the thermodynamic limit). However we will 
often be concerned with physical systems obeying entropy-driven dynamic processes that 
are contractive at this high level of granularity. Examples are any of the many-to-one 
mappings that can occur in digital computers, and, at a finer level of granularity, any 
of the error-correcting processes in the electronics of such a computer that allow it to 
operate in a digital fashion. Accordingly, although the dynamics of our system will 
always be deterministic, it need not be invertible. 

Observation 8.3: Intuitively, in our mathematics, all behavior across time is pre-fixed. 
The COIN is a single fixed worldline through Z, with no "unfolding of the future" 

as the die underlying a stochastic dynamics get cast. This is consistent with the fact 
that we want the formalism to be purely descriptive, relating different properties of any 
single, fixed COIN's history. Wc will often informally refer to "changing a node's state 
at a particular time", or to a microlcarner's "choosing from a set of options", and the 
like. Formally, in all such phrases we are really comparing different worldlines, with the 
indicated modification distinguishing those worldlines. 

Observation 8.4: Since the dynamics of any real-world COIN is deterministic, so is the 
dynamics of any component of the COIN, and in particular so is any learning algorithm 
running in the COIN, ultimately. However that does not mean that those deterministic 
components of the COIN are not allowed to be "based on" , or "motivated by" stochastic 
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concepts. The motivation behind the algorithms run by the components of the COIN 
does not change their underlying nature. Indeed, in our experiments below, we explicitly 
have the reinforcement learning algorithms that are trying to maximize private utility 
operate in a (pseudo-) probabilistic fashion, with pseudo-random number generators and 
the like. 

More generally, the deterministic nature of our framework does not preclude our su- 
perimposing probabilistic elements on top of that framework, and thereby generating 
a stochastic extension of our framework. Exactly as in statistical physics, a stochas- 
tic nature can be superimposed on our space of deterministic worldlines, potentially by 
adopting a degree of belief perspective on "what probability means" | |282| , 13S]. Indeed, 
the macrolearning algorithms we investigate below implicitly involve such a superimpos- 
ing; they implicitly assume a probabilistic coupling between the (statistical estimate of 
the) correlation coefficient connecting the states of a pair of nodes and whether those 
nodes are in the one another's "effect set". 

Similarly, while it does not salient characteristics that involve probability distribu- 
tions, the descriptive framework does not preclude such characteristics either. As an 
example, the "intelligence" of an agent's particular action, formally defined below, mea- 
sures the fraction of alternative actions an agent could have taken that would have 
resulted in a lower utility value. To define such a fraction requires a measure across the 
space of such alternative actions, even if only implicitly. Accordingly, intelligence can be 
viewed as involving a probability distribution across the space of potential actions. 

In this paper though, we concentrate on the mathematics that obtains before such 
probabilistic concerns are superimposed. Whereas the deterministic analysis presented 
here is related to game-theoretic structures like Nash equilibria, a full-blown stochastic 
extension would in some ways be more related to structures like correlated equilibria 

9) Formally, there is a lot of freedom in setting the boundary between what we call 
"the COIN", whose dynamics is determined by C, and what we call "macrolearning", 
which constitutes perturbations to the COIN instigated from "outside the COIN", and 
which therefore is not reflected in C. As an example, in much of this paper, we have 
clearly specified microlearners which are provided fixed private utility functions that they 
are trying to maximize. In such cases usually we will implicitly take C to be the dynamics 
of the system, microlearning and all, for fixed private utilities that are specified in For 
example, C could contain, for each microlearner, the bits in an associated computer 
specifying the subroutine that that microlearner can call to evaluate what its private 
utility would be for some full worldline C- 

Macrolearning overrides C, and in this situation it refers (for example) to any statis- 
tical inference process that modifies the private utilities at run-time to try to induce the 
desired salient characteristics. Concretely, this would involve modifications to the bits 
{bi} specifying each microlearner z's private utility, modifications that are not accounted 
for in C, and that are potentially based on variables that are not reflected in Z. Since 
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C does not reflect such macrolearning, when trying to ascertain C based on empirical 
observation (as for example when determining how best to modify the private utilities), 
we have to take care to distinguish which part of the system's observed dynamics is due 
to C and which part instead reflects externally imposed modifications to the private 
utilities. 

More generally though, other boundaries between the COIN and macrolearning-bascd 
perturbations to it are possible, reflecting other definitions of Z, and other interpretations 
of the elements of each C £ Z. For example, say that under the perspective presented 
in the previous paragraph, the private utility is a function of some components s of C, 
components that do not include the {hi}. Now modify this perspective so that in addition 
to the dynamics of other bits, C also encapsulates the dynamics of the bits {bi}. Having 
done this, we could still view each private utility as being fixed, but rather than take the 
bits {6j} as "encoding" the subroutine that specifies the private utility of microlearner i, 
we would treat them as "parameters" specifying the functional dependence of the (fixed) 
private utility on the components of C- In other words, formally, they constitute an extra 
set of arguments to i's private utility, in addition to the arguments s. Alternatively, we 
could simply say that in this situation our private utilities are time-indexed, with z's 
private utility at time t determined by which in turn is determined by evolution 

under C. Under either interpretation of private utility, any modification under C to the 
bits specifying i's utility-evaluation subroutine constitutes dynamical laws by which the 
parameters of z's microlearner evolves in time. In this case, macrolearning would refer 
to some further removed process that modifies the evolution of the system in a way not 
encapsulated in C. 

For such alternative definitions of C/Z, we have a different boundary between the 
COIN and macrolearning, and we must scrutinize different aspects of the COIN's dynam- 
ics to infer C. Whatever the boundary, the mathematics of the descriptive framework, 
including the mathematics concerning the salient characteristics, is restricted to a system 
evolving according to C, and explicitly does not account for macrolearning. This is why 
the strategy of trying to improve world utility by using macrolearning to try to induce 
salient characteristics is almost always ultimately based on an assumption rather than a 
proof. 

10) We are provided with some Von Neumann world utility G : Z — > 7^ that ranks 
the various conceivable worldlines of the COIN. Note that since the environment node 
is never directly observed, we implicitly assume that the world utility is not directly 
(!) a function of its state. Our mathematics will not involve G alone, but rather the 
relationship between G and various sets of personal utilities Qj^^t Z ^ Tl. 

Intuitively, as discussed below, for many purposes such personal utilities are equivalent 

to arbitrary "virtual" versions of the private utilities mentioned above. In particular, 
it is only private utilities that will occur within any microlearning computer algorithms 
that may be running in the COIN as manifested in C. Personal utilities are external 
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mathematical constructions that the COIN framework employs to analyze the behavior 
of the system. They can be involved in learning processes, but only as tools that are 
employed outside of the COIN's evolution under C, i.e., only in macrolearning. (For 
example, analysis of them can be used to modify the private utilities.) 

Observation 10.1: These utility definitions are very broad. In particular, they do not 
require casting of the utilities as discounted sums. Note also that our world utility is 
not indexed by t. Again reflecting the descriptive, worldline character of the formalism, 
we simply assign a single value to an entire worldline of the system, implicitly assuming 
that one can always say which of two candidate worldlines are preferable. So given some 
"present time" to, issues like which of two "potential futures" C^^^^^i ^'t>to preferable 
are resolved by evaluating the relevant utility at two associated points (" and C' > where 
the t > to components of those points are the futures indicated, and the two points share 
the same (usually implicit) t <to "past" components. 

This time-independence of G automatically avoids formal problems that can occur 
with general (i.e., not necessarily discounted sum) time-indexed utilities, problems like 
having what's optimal at one moment in time conflict with what's optimal at other 
moments in time.0 For personal utilities such formal problems are often irrelevant 
however. Before we begin our work, we as COIN designers must be able to rank all 
possible worldlines of the system at hand, to have a well-defined design task. That is 
why world utility cannot be time-indexed. However if a particular microlearner's goal 
keeps changing in an inconsistent way, that simply means that that microlearner will 
grow "confused". From our perspective as COIN designers, there is nothing a priori 
unacceptable about such confusion. It may even result in better performance of the 
system as a whole, in whic case we would actually want to induce it. Nonetheless, for 
simplicity, in most of this paper we will have all gr^^t be independent of t, just like world 
utility. 

World utility is defined as that function that we are ultimately interested in optimiz- 
ing. In conventional RL it is a discounted sum, with the sum starting at time t. In 
other words, conventional RL has a time-indexed world utility. It might seem that in 
this at least, conventional RL considers a case that has more generality than that of 
the COIN framework presented here. (It obviously has less generality in that its world 
utility is restricted to be a discounted sum.) In fact though, the apparent time-indexing 
of conventional RL is illusory, and the time-dependent discounted sum world utilty of 
conventional RL is actually a special case of the non-time-indexed world utility of our 
COIN framework. To see this formally, consider any (time-independent) world utility 

^^Such conflicts can be especially troublesome when they interfere with our defining what we mean 
by an "optimal" set of actions by the nodes at a particular time t. The effects of the actions by the 
nodes, adn therefore whether those actions are "optimal" or not, depends on the future actions of the 
nodes. However if they too are to be "optimal", according to their world-utility, those future actions 
will depend on their futures. So we have a potentially infinite regress of differing stipulations of what 
"optimal" actions at time t entails. 
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G{Q that equals Z^t^o7*^(Ci) for some function r(.) and some positive constant 7 with 

magnitude less than 1. Then for any t' > and any and C" where C'j^j; = ^"t<t'^ 
sgniGiO - G(C")] = sgn[j:r=o7'r{Q " EZolMC^J]- Conventional RL mereFy ex- 
presses this in terms of time-dependent utilities Ut'{C ^^^/) = Z^t^t' 7*~*V(C ^) by writing 
sgn[G{C') — G(C")] = sgiT'lufiC) ~ ^t'(C")] ^' • Since utility functions are, by defi- 

nition, only unique up to the relative orderings they impose on potential values of their 
arguments, we see that conventional RL's use of a time-dependent discounted sum world 
utility Uf is identical to use of a particular time-independent world utility in our COIN 
framework. 

11) As mentioned above, there may be variables in each node's state which, under 
one particular interpretation, represent the "utility functions" that the associated mi- 
crolearner's computer program is trying to extremize. When there are such components 
of C, we refer to the utilities they represent as private utilities. However even when 
there are private utilities, formally we allow the personal utilities to differ from them. 
The personal utility functions {y^} do not exist "inside the COIN" ; they are not specified 
by components of This separating of the private utilities from the {gr^} will allow us 
to avoid the teleological problem that one may not always be able to explicitly identify 
"the" private utility function reflected in ^ such that a particular computational device 
can be said to be a microlearncr "trying to increase the value of its private utility" . To 
the degree that we can couch the theorems purely in terms of personal rather than pri- 
vate utilities, we will have successfully adopted a purely behaviorist approach, without 
any need to interpret what a computational device is "trying to do" . 

Despite this formal distinction though, often we will implicitly have in mind deploying 
the personal utilities onto the microlearners as their private utilities, in which case the 
terms can usually be used interchangeably. The context should make it clear when this 
is the case. 

4.2.2 Intelligence 

We will need to quantify how well the entire system performs in terms of G. To do this 
requires a measure of the performance of an arbitrary worldline ^, for an arbitrary utility 
function, under arbitrary dynamic laws C. Formally, such a measure is a mapping from 
three arguments to R. 

Such a measure will also allow us to quantify how well each microlearner performs in 
purely behavioral terms, in terms of its personal utility. (In our behaviorist approach, we 
do not try to make specious distinctions between whether a microlearner's performance is 
due to its level of "innate sophistication" , or rather due to dumb luck — all that matters 
is the quality of its behavior as reflected in its utility value for the system's worldline.) 
This behaviorism in turn will allow us to avoid having private utilities explicitly arise in 
our theorems (although they still arise frequently in pedagogical discussion). Even when 
private utilities exist, there will be no formal need to explicitly identify some components 
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of C as such utilities. Assuming a node's microlearner is competent, the fact that it is 
trying to optimize some particular private utility U will be manifested in our performance 
measure's having a large value at C for C for that utility U. 

The problem of how to formally define such a performance measure is essentially 
equivalent to the problem of how to quantify bounded rationality in game theory. Some of 
the relevant work in game theory, for example that involving 'trembling hand equilibria' 
or 'e equilibria' [^7| is concerned with refinements or modifications of Nash equilibria (see 
also |161| ]). Rather than a behaviorist approach, such work adopts a strongly teleological 
perspective on rationality. In general, such work is only applicable to those situations 
where the rationality is bounded due to the precise causal mechanisms investigated in 
that work. Most of the other game-theoretic work first models (!) the microlearner, 
as some extremely simple computational device {e.g., a deterministic finite automaton 
(DFA). One then assumes that the microlearner performs perfectly for that device, so 
that one can measure that learner's performance in terms of some computational capacity 
measure of the model {e.g., for a DFA, the number of states of that DFA) ]10^, |02|, .0 
However, if taken as renditions of real- world computer-based microlearners — never mind 
human microlearners — the models in this approach are often extremely abstracted, with 
many important characteristics of the real learners absent or distorted. In addition, there 
is little reason to believe that any results arising from this approach would not be highly 
dependent on the model choice and on the associated representation of computational 
capacity. Yet another disadvantage is that this approach concentrates on perfect, fully 
rational behavior of the microlearners, within their computational restrictions. 

We would prefer a less model-dependent approach, especially given our wish that the 
performance measure be based solely on the utility function at hand, and C. Now we 
don't want our performance measure to be a "raw" utility value like gr]{C)i since that is 
not invariant with respect to monotonic transformations of grj. Similarly, we don't want 
to penalize the microlearner for not achieving a certain utility value if that value was 
impossible to achieve not due to the microlearner's shortcomings, but rather due to C 
and the actions of other nodes. A natural way to address these concerns is to generalize 
the game-theoretic concept of "best-response strategy" and consider the problem of how 
well 7] performs given the actions of the other nodes. Such a measure would compare the 
utility ultimately induced by each of the possible states of rj at some particular time, 
which without loss of generality we can take to be 0, to that induced by the actual state 
C Q- III other words, we would compare the utility of the actual worldline C to those 
of a set of alternative worldlines C', where ^ = C'-^q) and use those comparisons to 
quantify the quality of r/'s performance. 

Now we are only concerned with comparing the effects of replacing ^ with on future 
contributions to the utility. But if we allow arbitrary C'^^q' then in and of themselves 

^''Some of the more popular model-based scenarios for investigating bounded rationality, like 'ficticious 
play' (see the game theory section above), do not even stipulate one particular way to quantify that 
rationality. 
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the difference between those past components of s^nd those of ( can modify the value 
of the utihty, regardless of the effects of any difference in the future components. Our 
presumption is that for many COINs of interest we can avoid this conundrum by restrict- 
ing attention to those where C'^^q "^^^^^^ from ( only in the internal parameters of 
r/'s microlearner, differences that only at times t > manifest themselves in a form the 
utility is concerned with. (In game-theoretic terms, such "internal parameters" encode 
full extensive-form strategies, and we only consider changes to the vertices at or below 
the t = level in the tree of an extensive-form strategy.) 

Although this solution to our conundrum is fine when we can apply it, we don't 
want to restrict the formalism so that it can only concern systems having computational 
algorithms which involve a clearly pre-specified set of extensive strategy "internal pa- 
rameters" and the like. So instead, we formalize our presumption behaviorally, even for 
computational algorithms that do not have explicit extensive strategy internal param- 
eters. Since changing the internal parameters doesn't affect the t < components of 
( that the utility is concerned with, and since we are only concerned with changes to 

— Vi 

C that affect the utility, we simply elect to not change the t < values of the internal 
parameters of at all. In other words, we leave ^^^unchanged. The advantage of 
this stipulation is that we can apply it just as easily whether rf does or doesn't have any 
"internal parameters" in the first place. 

So in quantifying the performance of rj for behavior given by C we compare C to a 
set of C'l a set restricted to those C' sharing Cs past: C'^ ^ = C'- ^ = and 

— ' ^ ^ :i ^ — ,t<0 — ,t<0 —ri,0 — r;,0 

C'j^Q G C,t>o- Since (' ^ is free to vary (reflecting the possible changes in the state of r] at 
time 0) while C'^^g ^' ^ ^' general. We may even wish to allow C'^g ^ ^,t>o in 

certain circumstances. (Recall that C may reflect other restrictions imposed on allowed 
worldlines besides adherence to the underlying dynamical laws, so simply obeying those 
laws does not suffice to ensure that a worldline lies on C.) In general though, our 
presumption is that as far as utility values are concerned, considering these dynamically 
impossible (' is equivalent to considering a more restricted set of (' with "modified 
internal parameters" , all of which are G C. 

We now present a formalization of this performance measure. Given C and a measure 
diJ,{( demarcating what points in g we are interested in, we define the (t = 0) 
intelligence for node r/ of a point ^ with respect to a utility U as follows: 

e,,u{0 ^ J dfiiQ epio - C7(C ,<g • CiQ)] ■ 6{C^^^ c^,,o) (1) 

where 0(.) is the Heaviside theta function which equals if its argument is below and 
equals 1 otherwise, 6{.) is the Dirac delta function, and we assume that / dfi{C'_^^^) = 1. 

Intuitively, en,u{C) measures the fraction of alternative states of rj which, if rj had 
been in those states at time 0, would either degrade or not improve jy's performance (as 
measured hy U). Sometimes in practice we will only want to consider changes in those 
components of C g that we consider as "free to vary", which means in particular that 



42 



those changes are consistent with C and the state of the external world, C,^ ^. (This 
consistency ensures that r/'s observational information concerning the external world is 
correct; see Observation 7.1 above.) Such a restriction means that even though Q ^ may 
not be consistent with C and Q ^^q, by itself it is still consistent with C; in quantifying the 
quality of a particular C,^^. So we don't compare our point to other C, ^ that are physically 
impossible, no matter what the past is. Any such restrictions on what changes we are 
considering are reflected implicitly in intelligence, in the measure d^. 

As an example of intelligence, consider the situation where for each player r/, the 
support of the measure ^/^(C'q) extends over all possible actions that rj could take that 
affect the ultimate value of its personal utility, g^. In this situation we recover conven- 
tional full rationality game theory involving Nash equilibria, as the analysis of scenarios 
in which the intelligence of each player r/ with respect to grj equals As an alterna- 
tive, we could for each rj restrict dulC'^) to some limited "set of actions that r] actively 
considers". This provides us with an "effective Nash equilibrium" at the point C where 
each 6)7,9^ (C) equals 1, in the sense that as far it's concerned, each player r] has played a 
best possible action at such a point. As yet another alternative, we could restrict each 
^//(C'q) to some infinitesimal neighborhood about ( ^, and thereby define a "local Nash 
equilibrium" by having e^_g^ (Q = 1 for each player rj. 

In general, competent greedy pursuit of private utility U by the microlearner control- 
ling node rj means that the intelligence of r] for personal utility U, en,uiC): is close to 1. 
Accordingly, we will often refer interchangeably to a capable microlearner 's "pursuing 
private utility and to its having high intelligence for personal utility U. Alterna- 
tively, if the microlearner for node rj is incompetent, then it may even be that "by luck" 
its intelligence for some personal utility {grj} exceeds its intelligence for the different 
private utility that it's actually trying to maximize, C/^. 

Say that we expect that a particular microlearner is "smart", in that it is more 
likely to have high rather than low intelligence. We can model this by saying that 
given a particular ^, the conditional probability that ( q = -z is a monotonically 
increasing function of ^'rj,gv^^ t<o * ^^'^ * ^' o-'-'" ^^^^^ ^ given ^ the intelligence 
erf^g^ is a monotonically increasing function of Qrj, this modelling assumption means that 
the probability that ^ = z is a monotonically increasing function of dviC * C{z • 

p)). An alternative weaker model is to only stipulate that the probability of having 
a particular pair ^X- q) with e^^^^ equal to z is a monotonically increasing function 
of z. (This probability is an integral over a joint distribution, rather than a conditional 

^■'As an alternative to such fully rational games, one can define a bounded rational game as one 
in which the intelligences equal some vector e* whose components need not all equal 1. Many of the 
theorems of conventional game theory can be directly carried over to such bounded-rational games [ 284 ] 
by redefining the utility functions of the players. In other words, much of conventional full rationality 
game theory applies even to games with bounded rationality, under the appropriate transformation. This 
result has strong implications for the legitimacy of the common criticism of modern economic theory that 
its assumption of full rationality does not hold in the real world, implications that extend significantly 
beyond the Sonnenschein-Mantel-Debreu Theorem equilibrium aggregate demand theorem |177|. 
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distribution, as in the original model.) In either case, the "better" the microlearner, the 
more tightly peaked the associated probability distribution over intelligence values is. 

Any two utility functions that are related by a monotonically increasing transforma- 
tion reflect the same preference ordering over the possible arguments of those functions. 
Since it is only that ordering that we are ever concerned with, we would like to remove 
this degeneracy by "normalizing" all utility functions. In other words, we would like to 
reduce any equivalence set of utility functions that are monotonic transformations of one 
another to a canonical member of that set. To see what this means in the COIN context, 
fix . Viewed as a function from — > TZ, e^^[/(C- ) ■) is itself a utility function, one 
that is a monotonically increasing function of U. (It says how well 77 would have per- 
formed for all vectors C .) Accordingly, the integral transform taking C7 to e„ [/(C- j ■) is 
a (contractive, non-invcrtiblc) mapping from utilities to utilities. Applied to any member 
of a utility in U's equivalence set, this mapping produces the same image utility, one 
that is also in that equivalence set. It can be proven that any mapping from utilities 
to utilities that has this and certain other simple properties must be such an integral 
transform. In this, intelligence is the unique way of "normalizing" Von Neumann utility 
functions. 

For those conversant with game theory, it is worth noting some of the interesting 
aspects that ensue from this normalizing nature of intelligences. At any point ( that is a 
Nash equilibrium in the set of personal utilities {^jj}, all intelligences ^ri,g^{0 must equal 
1. Since that is the maximal value any intelligence can take on, a Nash equilibrium in 
the {g-q} is a Parcto optimal point in the associated intelligences (for the simple reason 
that no deviation from such a C, can raise any of the intelligences). Conversely, if there 
exists at least one Nash equilibrium in the {gn}, then there is not a Pareto optimal point 
in the {e?7,g^(C)} that is not a Nash equilibrium. 

Now restrict attention to systems with only a single instant of time, i.e., single-stage 
games. Also have each of the (real- valued) components of each be a mixing component 
of an associated one of r/'s potential strategies for some underlying finite game. Then 
have grjiC) be the associated expected payoff to rj. (So the payoff to r] of the underlying 
pure strategies is given by the values of gr^{C) when ^ is a unit vector in the space of 
r/'s possible states.) Then we know that there must exist at least one Nash equilibrium in 
the {gn}- Accordingly, in this situation the set of Nash equilibria in the {grj} is identical 
to the set of points that are Pareto optimal in the associated intelligences. (See Eq. 5 
in the discussion of factored systems below.) 

4.2.3 Learnability 

Intelligence can be a difficult quantity to work with, unfortunately. As an example, fix 
rj, and consider any (small region centered about some) C, along with some utility U , 
where C, is not a local maximum of U. Then by increasing the values U takes on in that 
small region we will increase the intelligence e^^[/(C). However in doing this we will also 
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necessarily decrease the intelligence at points outside that region. So intelligence has 
a non-local character, a character that prevents us from directly modifying it to ensure 
that it is simultaneously high for any and all 

A second, more general problem is that without specifying the details of a mi- 
crolearner, it can be extremely difficult to predict which of two private utilities the 
microlearner will be better able to learn. Indeed, even with the details, making that 
prediction can be nearly impossible. So it can be extremely difficult to determine what 
private utility intelligence values will accrue to various choices of those private utilities. 
In other words, macrolearning that involves modifying the private utilities to try to 
directly increase intelligence with respect to those utilities can be quite difficult. 

Fortunately, we can circumvent many of these difficulties by using a proxy for (private 
utility) intelligence. Although we expect its value usually to be correlated with that of 
intelligence in practice, this proxy does not share intelligence's non-local nature. In ad- 
dition, the proxy does not depend heavily on the details of the microlearning algorithms 
used, i.e., it is fairly independent of those aspects of C. Intuitively, this proxy can be 
viewed as a "salient characteristic" for intelligence. 

We motivate this proxy by considering having g-rj = G for all r]. If we try to actually 
use these {grj} as the microlearners' private utilities, particularly if the COIN is large, we 
will invariably encounter a very bad signal-to-noise problem. For this choice of utilities, 
the effects of the actions taken by node r] on its utility may be "swamped" and effectively 
invisible, since there are so many other processes going into determining G's value. This 
makes it hard for r/ to discern the echo of its actions and learn how to improve its private 
utility. It also means that r] will find it difficult to decide how best to act once learning 
has completed, since so much of what's important to r] is decided by processes outside 
of ijs immediate purview. In such a scenario, there is nothing that r/'s microlearner can 
do to reliably achieve high intelligence.P| 

In addition to this "observation-driven" signal/noise problem, there is an "action- 
driven" one. For reasons discussed in Observation 7.1 above, we can define a distribution 
d^{C,' reflecting what r] does/doesn't know concerning the actual state of the outside 
world r/ at time 0. If the node r/ chooses its actions in a Bayes-optimal manner, then 

ir,fi;act = ^rgmax, [/ (i^(C:.^ o)C/(C • '^(^ • C^,0;aci • C^r,,o))]' ^^^^^ ^ ^"^^ ^^^^ 

allowed action components of ry at time 0. Since this will differ from argmax^C/(^ • 
C{z • Q Q.^^^ • C- q))] in general, this Bayes-optimal node's intelligence will be less than 
1 for the particular C, at hand, in general. Moreover, the less ?7's ultimate value (after 
the application of C, etc.) depends on C-^g' smaller the difference in these two 
argmax-based z's, and therefore the higher the intelligence of 77, in general.p^ 

^^This "signal-to-noise" problem is actually endemic to reinforcement learning as a whole, even some- 
times occurring when one has just a single reinforcement learner, and only a few random variables jointly 
determining the value of the rewards 287 1. 

^^In practice, due to computational limitations if nothing else, the node won't be exactly Bayes- 
optimal. But incorporating such a suboptimality doesn't affect the thrust of this argument that we want 



45 



We would like a measure of U that captures these efects, but without depending 
on function maximization or any other detailed aspects of how the node determines its 
actions. One natural way to do this is via the (utility) learnability: Given a measure 
dfi{C'^) restricted to a manifold C, the {t = 0) utility learnability of a utility U for a 
node r/ at is: 

Intelligence learnability is defined the same way, with U{.) replaced by eri,u{ )- Note 
that any affine transformation of U has no effect on either the utility learnability A^,[/(C) 
or the associated intelligence learnability, A^^^^ uiO- 

The integrand in the numerator of the definition of learnability reflects how much 
of the change in U that results from replacing ( ^ with ('^ is due to the change in rj^s 
t = state (the "signal"). The denominator reflects how much of the change in U that 
results from replacing (" with (' is due to the change in the t = states of nodes other 
than T] (the "noise"). So learnability quantifies how easy it is for the microlearner to 
discern the "echo" of its behavior in the utility function U. Our presumption is that the 
microlearning algorithm will achieve higher intelligence if provided with a more learnable 
private utility. 

Intuitively, the (utility) differential learnability of at a point ( is the learnability 
with d/x restricted to an infinitesimal ball about (. We formalize it as the following ratio 
of magnitudes of a pair of gradients, one involving ij, and one involving "ry: 

^v,uiQ ^ II -;\,,, . (3) 



ll%,,o^(C.<0*^(Co) 



Note that a particular value of differential utility learnability, by itself, has no sig- 
nificance. Simply rescaling the units of C will change that value. Rather what is 
important is the ratio of differential Icarnabilities, at the same for different C/'s. Such 
a ratio quantifies the relative prcfcrability of those U's. 

One nice feature of differential learnability is that unlike learnability, it does not 
depend on choice of some measure dii{.). This independence can lead to trouble if one 
is not careful however, and in particular if one uses learnability for purposes other than 
choosing between utility functions. For example, in some situations, the COIN designer 
will have the option of enlarging the set of variables from the rest of the COIN that are 
"input" to some node r/ at i = and that therefore can be used by rj to decide what 
action to take. Intuitively, doing so will not affect the RL "signal" for r/'s microlearner 
(the magnitude of the potential "echo" of ry's actions are not modified by changing some 
aspect of how it chooses among those actions). However it will reduce the "noise", in 
that 77's microlearner now knows more about the state of the rest of the system. 

Vs ultimate value to not depend on C-^ q- 
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In the full integral version of learnability, this effect can be captured by having the 
support of dfj,{.) restricted to reflect the fact that the extra inputs to r/ at t = are 
correlated with the t = state of the external system. In differential learnability however 
this is not possible, precisely because no measure dfi{.) occurs in its definition. So we 
must capture the reduction in noise in some other fashion. P] 

Alternatively, if the extra variables are being input to rj for all t > 0, not just at 
t = 0, and if rj "pays attention" to those variables for all t > 0, then by incorporating 
those changes into our system C itself has changed, V t > 0. Hypothesize that at 
those t the node r] is capable of modifying its actions to "compensate" for what (due 
to our augmentation of r/'s inputs) rj now knows to be going on outside of it. Under 
this hypothesis, those changes in those external events will have less of an effect on 
the ultimate value of 5^ than they would if we had not made our modification. In this 
situation, the noise term has been reduced, so that the differential learnabiliity properly 
captures the effect of ij's having more inputs. 

Another potential danger to bear in mind concerning differential learnability is that 
it is usually best to consider its average over a region, in particular over points with 
less than maximal intelligence. It is really designed for such points; in fact, at the 
intelligence-maximizing (, Xri^i/{C) = 0- 

Whether in its differential form or not, and whether referring to utilities or intel- 
ligence, learnability is not meant to capture all factors that will affect how high an 
intelligence value a particular microlearner will achieve. Such an all-inclusive definition 
is not possible, if for no other reason the fact that there are many such factors that 
are idiosyncratic to the particular microlearner used. Beyond this though, certain more 
general factors that affect most popular learning algorithms, like the curse of dimension- 
ality, are also not (explicitly) designed into learnability. Learnability is not meant to 
provide a full characterization of performance — that is what intelligence is designed to 
do. Rather (relative) learnability is ony meant to provide a guide for how to improve 
performance. 

A system that has infinite (differential, intelligence) learnability for all its personal 
utilities is said to be "perfectly" (differential, intelligence) learnable. It is straight- 
forward to prove that a system is perfectly learnable VC S C iff V7y,VC G C,gr^{C) can 

^^One way to capture this noise reduction is to replace the noise term 9^ U {(, *C{(^ ^)) occurring 
in the definition of differential learnability with something more nuanced. For example, one may wish 
to replace it with the maximum of the dot product of u = 9j U(Q with any Z vector v, subject not only 
to the restrictions that — 1 and g = 0, but also subject to the restriction that v must lie in the 
tangent plane of C at (. The first two restrictions, in concert with the extra restriction that v = 0, 
give the original definition of the noise term. If they are instead joined with the third, new restriction, 
they will enforce any applicable coupling between the state of rj at time and the rest of the system at 
time 0. Solving with Lagrange multipliers, we get vocu — A2Q — A3/3, where a is the normal to C at ^, 
^, = Srji^ri 5t',t, and A2 = ~'' while A3 = ~'' ■ As a practical matter though, it is 

often simplest to assume that the ^ can vary arbitrarily, independent of ^ 0' that the noise term 
takes the form in Eq. 3. 
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be written as for some function tp-qi-)- (See the discussion below on the general 

condition for a system's being perfectly factored.) 

4.3 A descriptive framework for COINs 

With these definitions in hand, we can now present (a portion of) one descriptive frame- 
work for COINs. In this subsection, after discussing salient characteristics in general, 
we present some theorems concerning the relationship between personal utilities and the 
salient characteristic we choose to concentrate on. We then discus how to use these 
theorems to induce that salient characteristic in a COIN. 

4.3.1 Candidate salient characteristics of a COIN 

The starting point with a descriptive framework is the identification of "salient charac- 
teristics of a COIN which one strongly expects to be associated with its having large 
world utility" . In this chapter we will focus on salient characteristics that concern the re- 
lationship between personal and world utilities. These characteristics are formalizations 
of the intuition that we want COINs in which the competent greedy pursuit of their pri- 
vate utilities by the microlearners results in large world utility, without any bottlenecks, 
TOC, "frustration" (in the spin glass sense) or the like. 

One natural candidate for such a characteristic, related to Pareto optimality [|10[)|, 



101 1, is weak triviality. It is defined by considering any two worldlines (" and C' 
both of which are consistent with the system's dynamics (i.e., both of which lie on C), 
where for every node rj, grjiC) ^ 5»;(C')J3 foi' such pair of worldlines where one 
"Pareto dominates" the other it is necessarily true that G{C) > G(("'), we say that the 
system is weakly trivial. We might expect that systems that are weakly trivial for the 
microlearners' private utilities are configured correctly for inducing large world utility. 
After all, for such systems, if the microlearners collectively change C in a way that ends 
up helping all of them, then necessarily the world utility also rises. More formally, for 
a weakly trivial system, the maxima of G are Pareto-optimal points for the personal 
utilities (although the reverse need not be true). 

As it turns out though, weakly trivial systems can readily evolve to a world utility 
minimum, one that often involves TOC. To see this, consider automobile traffic in the 
absence of any traffic control system. Let each node be a different driver, and say their 
private utilities are how quickly they each individually get to their destination. Identify 
world utility as the sum of private utilities. Then by simple additivity, for all ( and C', 
whether they lie on C or not, if g-qiC) > 9r?(C') it follows that G{C) > G(C'); the 
system is weakly trivial. However as any driver on a rush-hour freeway with no carpool 
lanes or metering lights can attest, every driver's pursuing their own goal definitely does 



^*An obvious variant is to restrict C'j^q — C j^qj ^nd require only that both of the "partial vectors" 



C\^fj a-nd ( obey the relevant dynamical laws, and therefore lie in C,t 
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not result in acceptable throughput for the system as a whole; modifications to private 
utility functions (like fines for violating carpool lanes or metering lights) would result 
in far better global behavior. A system's being weakly trivial provides no assurances 
regarding world utility. 

This does not mean weak triviality is never of use. For example, say that for a set 
of weakly trivial personal utilities each agent can guarantee that regardless of what the 
other agents do, its utility is above a certain level. Assume further that, being risk-averse, 
each agent chooses an action with such a guarantee. Say it is also true that the agents 
are provided with a relatively large set of candidate guaranteed values of their utilities. 
Under these circumstances, the system's being weakly trivial provides some assurances 
that world utility is not too low. Moreover, if the overhead in enforcing such a future- 
guaranteeing scheme is small, and having a sizable set of guaranteed candidate actions 
provided to each of the agents does not require an excessively centralized infrastructure, 
we can actually employ this kind of scheme in practice. Indeed, in the extreme case, 
one can imagine that every agent is guaranteed exactly what its utility would be for 
every one of its candidate actions. (See the discussion on General Equilibrium in the 
Background Section above.) In this situation, Nash equilibria and Pareto optimal points 
are identical, which due to weak triviality means that the point maximizing G is a Nash 
equilibrium. However in any less extreme situation, the system may not achieve a value 
of world utility that is close to optimal. This is because even for weakly trivial systems 
a Pareto optimal point may have poor world utility, in general. 

Situations where one has guarantees of lower bounds on one's utility arc not too com- 
mon, but they do arise. One important example is a round of trades in a computational 
market (see the Background Section above). In that scenario, there is an agent-indexed 
set of functions {/r)(z G Z (''))} and the personal utility of each agent r] G {1,2,...} is 
given by fri{C,^f*)i where t* is the end of the round of trades. There is also a function 
F(z G Z^°)) = F(/i(z), f2(z), ...) that is a monotonically increasing function of its argu- 
ments, and world utility G is given by F{C, = -^(/i(C ^.)) /2(C j*)) •••)■ system 
is weakly trivial. In turn, each /^(z) is determined solely by the "allotment of goods" 
possessed by r/, as specified in the appropriate components of z^. To be able to remove 
uncertainty about its future value of fr, in this kind of system, in determining its trad- 
ing actions each agent rj must employ some scheme like inter-agent contracts. This is 
because without such a scheme, no agent can be assured that if it agrees to a proposed 
trade with another agent that the full proposed transaction of that trade actually occurs. 
Given such a scheme, if in each trade round t each agent 77 myopically only considers 
those trades that are assured of increasing the corresponding value of then we are 
guaranteed that the value of the world utility is not less than the initial value F{C, g). 

The problem with using weak triviality as a general salient characteristic is precisely 
the fact that the individual microlearners are greedy. In a COIN, there is no system- 
wide incentive to replace C, with a different worldline that would improve everybody's 
private utility, as in the definition of weak triviality. Rather the incentives apply to each 
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microlearner individually and motivate the learners to behave in a way that may well 
hurt some of them. So weak triviality is, upon examination, a poor choice for the salient 
characteristic of a COIN. 

One alternative to weak triviality follows from consideration of the stricture that we 
must 'expect' a salient characteristic to be coupled to large world utility in a running 
real-world COIN. What can we reasonably expect about a running real- world COIN? Wc 
cannot assume that all the private utilities will have large values — witness the traffic 
example. But we can assume that if the microlearners are well-designed, each of them 
will be doing close to as well it can given the behavior of the other nodes. In other words, 
within broad limits we can assume that the system is more likely to be in ( than (' if for 
all rj, er),gn{C) > ^»?,3,,(C')- We define a system to be coordinated iff for any such ( and 
C' lying on C, G(C) ^ (Again, an obvious variant is to restrict C'^^q ~ t<o' ^^'^ 

require only that both C(>o '^'oo C^t>o-) Traffic systems are not coordinated, 

in general. This is evident from the simple fact that if all drivers acted as though there 
were metering lights when in fact there weren't any, they would each be behaving with 
lower intelligence given the actions of the other drivers (each driver would benefit greatly 
by changing its behavior by no longer pretending there were metering lights, etc.). But 
nonetheless, world utility would be higher. 

4.3.2 The Salient Characteristic of Factoredness 

Like weak triviality, coordination is intimately related to the economics concept of Pareto 
optimality. Unfortunately, there is not room in this chapter to present the mathematics 
associated with coordination and its variants. We will instead discuss a third candidate 
salient characteristic of COINs, one which like coordination (and unlike weak triviality) 
we can reasonably expect to be associated with large world utility. This alternative fixes 
weak triviality not by replacing the personal utilities {Qt]} with the intelligences {er/.g,,} 
as coordination does, but rather by only considering worldlines whose difference at time 
involves a single node. This results in this alternative's being related to Nash equilibria 
rather than Pareto optimality. 

Say that our COIN's worldline is (. Let (' be any other worldline where ( = C'^^q) 
and where C'j>o ^ ^,t>o- Now restrict attention to those (' where at t = ( and C' differ 
only for node r). If for all such (' 

sgnMQ - <7,(C ,<o • = sgn[GiO - G(C • CiQ)] , (4) 

and if this is true for all nodes rj, then we say that the COIN is factored for all those 
utilities {g^^} (at with respect to time and the utility G). 

For a factored system, for any node rj, given the rest of the system, if the node's state 
at t = changes in a way that improves that node's utility over the rest of time, then 
it necessarily also improves world utility. Colloquially, for a system that is factored for 
a particular microlearner 's private utility, if that learner does something that improves 
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that personal utility, then everything else being equal, it has also done something that 
improves world utility. Of two potential microlearners for controlling node rj (i.e., two 
potential C^) whose behavior until t = is identical but which differ there, the mi- 
crolearner that is smarter with respect to g will always result in a larger g, by definition 
of intelligence. Accordingly, for a factored system, the smarter microlearner is also the 
one that results in better G. So as long as we have deployed a sufficiently smart mi- 
crolearner on r], we have assured a good G (given the rest of the system). Formally, this 



is expressed in the fact [283| that for a factored system, for all nodes rj, 

e^,g^{Q = er,,G{Q ■ (5) 

One can also prove that Nash equilibria of a factored system are local maxima of world 
utility. Note that in keeping with our behaviorist perspective, nothing in the definition 
of factored requires the existence of private utilities. Indeed, it may well be that a system 
having private utilities {Ur/} is factored, but for personal utilities {gr;} that differ from 
the {Urj}. 

A system's being factored does not mean that a change to q that improves gr]{C) 
cannot also hurt grj'iC) ^or some rj' ^ r]. Intuitively, for a factored system, the side 
effects on the rest of the system of r/'s increasing its own utility do not end up decreasing 
world utility — but can have arbitrarily adverse effects on other private utilities. (In 
the language of economics, no stipulation is made that ry's "costs are endogenized.") For 
factored systems, the separate microlearners successfully pursuing their separate goals 
do not frustrate each other as far as world utility is concerned. 

In addition, if g„^t' is factored with respect to G, then a change to ( , that improves 
f^.t'(C,t<t" improves (^(C j^^,, C'(C j,)). But it may hurt some gv,t"^t'{C^^^^nC{C^^,)) 

and/or e[r),t"),g,^ ^„ (C ^^^n C(C ^i))- (This is even true for a discounted sum of rewards per- 
sonal utility, so long as t" > t' .) An example of this would be an economic system cast as 
a single individual, r], together with an environment node, where G is a steeply discounted 
sum of rewards rj receives over his/her lifetime, t" > t', and Vt, gr],t{C) = G'(CL^f<f/(C)). 
For such a situation, it may be appropriate for rj to live extravagantly at the time t', and 
"pay for it" later. 

As an instructive example of the ramifications of Eq. 5, say node r/ is a conventional 
computer. We want eri^ciC) to be as high as possible, i.e., given the state of the rest of 
the system at time 0, we want computer r/'s state then to be the best possible, as far as 
the resultant value of G is concerned. Now a computer's "state" consists of the values 
of all its bits, including its code segment, i.e., including the program it is running. So 
for a factored personal utility grj, if the program running on the computer is better than 
most others as far as grj is concerned, then it is also better than most other programs as 
far as G is concerned. 

Our task as COIN designers engaged in COIN initialization or macrolearning is to find 
such a program and such an associated g^. One way to approach this task is to restrict 
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attention to programs that consist of RL algorithms with private utihty specified in the 
bits {bi} of rj. This reduces the task to one of finding a private utihty {bi} (and thereby 
fuhy specifying ( ) such that our RL algorithm working with that private utility has 
high Crj^g^, i.e., such that that algorithm outperforms most other programs as far as the 
personal utility 5.^ is concerned. 

Perhaps the simplest way to address this reduced task is to exploit the fact that for a 
good enough RL algorithm G^^{bi} will be large, and therefore adopt such an RL algorithm 
and fix the private utility to equal (7^. In this way we further reduce the original task, 
which was to search over all personal utilities grj and all programs R to find a pair such 
that both is factored with respect to G and there are relatively few programs that 
outperform R, as far as The task is now instead to search over all private utilities {bi} 
such that both {bi} is factored with respect to G and such that there are few programs 
{of any sort, RL-based or not) that outperform our RL algorithm working on {bi}, as far 
as that self-same private utility is concerned. The crucial assumption being leveraged 
in this approach is that our RL algorithm is "good enough", and the reason we want 
learnable {bi} is to help effect this assumption. 

In general though, we can't have both perfect learnability and perfect factoredness. 
As an example, say that Vt,Z^^^ = Z-^^^ = TZ, and that the dynamics is the identity 
operator: Vt, (^(C ^),* = C q- Then if G{C^) = C^^^ • C-^ g and the system is perfectly 
learnable, it is not perfectly factored. This is because perfect learnability requires that 
VC S C,grj{C) = ijjrjiC q) some function tprji-)- However any change to C q that 
improves such a grj will either help or hurt G{C), depending on the sign of . For 
the "wrong" sign of C-^^q, this means the system is actually "anti- factored" . Due to such 
incompatibility between perfect factoredness and perfect learnability, we must usually be 
content with having high degree of factoredness and high learnability. In such situations, 
the emphasis of the macrolearning process should be more and more on having high 
degree of factoredness as we get closer and closer to a Nash equilibrium. This way the 
system won't relax to an incorrect local maximum. 

In practice of course, a COIN will often not be perfectly factored. Nor in practice 
are we always interested only in whether the system is factored at one particular point 
(rather than across a region say) . These issues are discussed in p83(| , where in particular 
a formal definition of of the degree of factoredness of a system is presented. 

If a system is factored for utilities {g-q}, then it is also factored for any utilities {g'^} 
where for each rj g'^ is a monotonically increasing function of grj. More generally, the 
following result characterizes the set of all factored personal utilities: 

Theorem 1: A system is factored at sdl C £ C iff for all those Vr/, we can write 

ffr,(C) = ^,(C,<o'C~,,0'^(0) 

for some function <^^(., ., .) such that dc^riiC j^g' 0' > for all C G C and associated 
G values. (The form of the {gr]} off of C is arbitrary.) 
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Proof: For fixed ^ and ( any change to ^ which keeps C on C and which at 
the same time increases G(C) = (^(C i<o»C(C-^ ^'C^ g)) must increase ^ri{C ^^Q,C'^ Q, G(C)), 
due to the restriction on dc^-qiC f-^Q^C- q^^)- This estabhshes the backwards direction 
of the proof. 

For the forward direction, write 5^(C) = 9rj{C, G{C)) = 9viC^t<o'C{C~^ Q»C^ Q), G{C)) V C S 
C. Define this formulation of as ^riiC ^^qX qjG{C)), which we can re-express as 
^'?(^,t<0'^ 77,0 • ir,,o' '^(0)- Now since the system is factored, VC G C*, VC'^^^ G C,t>o, 



So consider any situation where the system is factored, and the values of G, C j^g? and 
p are specified. Then we can find any ( ^ consistent with those values {i.e., such that 
our provided value of G equals G(C • C{C~^q • C^o-'-')' evaluate the resulting value 
of $„(C, „,C- ^ • C r.^G), and know that we would have gotten the same value if we 
had found a different consistent C^^^. This is true for all C G C*. Therefore the mapping 
(^,t<0'^'>?,0'^) ^ single- valued, and we can write ^r,(C ^^q, C-^ q' ^^i))' Q^D. 

By Thm. 1, we can ensure that the system is factored without any concern for 
C, by having each c/^(C) = ^^jlC t<o'^7y,o' '^'^^^^ ^ -• Ahernatively, by only re- 
quiring that VC e C7 does g^(C) = ^77(C j<o' C'^,o' ^(i)) ^"^^"^ ^^(^,t<o * ^(^,0)) = 
$^(C^^Q,C- 0'^^'' t<o * ^("^ o-^)-'-^' access a broader class of factored utilities, a 
class that does depend on C . Loosely speaking, for those utilities, we only need the 
projection of 9^ G(C) onto C^^o to be parallel to the projection of 9^ >g5'^('') onto 
Cy^^o- Given G and C, there are infinitely many 9^ S'rjlC) having this projection (the 
set of such >q5^(0 form a linear subspace of Z). The partial differential equations 
expressing the precise relationship are discussed in |283| ] . 

As an example of the foregoing, consider a 'team game' (also known as an 'exact 
potential game' []83| , [193| ) in which g^^ = G for all r/. Such COINs are factored, trivially, 
regardless of C; if grj rises, then G must as well, by definition. (Alternatively, to confirm 
that team games are factored just take ^r?(Cj<o'^" o'^-' ~ ^ ™ Thm. 1.) On the 
other hand, as discussed below, COINs with 'wonderful life' personal utilities are also 
factored, but the definition of such utilities depends on G. 

4.3.3 Wonderful life utility 

Due to their often having poor learnability and requiring centralized communication 
(among other infelicities), in practice team game utilities often are poor choices for 
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personal utilities. Accordingly, it is often preferable to use some other set of factored 
utilities. To present an important example, first define the (t = 0) effect set of node r] 

at C: C^-^-^{C), as the set of all components ^ for which 5^ {C{C Q))r)',t 7^ 0. Define the 
effect set Cfi^^ with no specification of ( as ^(ecCj/-^ {()■ (We take this latter definition 
to be the default meaning of "effect set".) We will also find it useful to define 'C^'^'^ as 
the set of components of the space Z that are not in C^^^ . 

Intuitively, r/'s effect set is the set of all components C, , which would be affected 
by a change in the state of node rj at time 0. (They may or may not be affected by 
changes in the t = states of the other nodes.) Note that the effect sets of different 
nodes may overlap. The extension of the definition of effect sets for times other than 
is immediate. So is the modification to have effect sets only consist of those components 
C . that vary with with the state of node rj at time 0, rather than consist of the full 
vectors Q ^ possessing such a component. These modifications will be skipped here, to 
minimize the number of variables we must keep track of. 

Next for any set a of components (rj'jt), define CLo-(C) as the "virtual" vector formed 
by clamping the cr-components of (" to an arbitrary fixed value. (In this paper, we 
take that fixed value to be for all components listed in a.) Consider in particular a 
Wonderful Life set a. The value of the wonderful life utility (WLU for short) for a 
at ( is defined as: 

WLU.iQ = G(C) - G(CW(0) . (7) 

In particular, the WLU for the effect set of node rj is G(C) — G'(CL^e//(C)), which for 
C e C can be written as G(C • C(C „)) - G(CL^e// (C^^^^ • C(C q)))." 

We can view 77's effect set WLU as analogous to the change in world utility that would 
have arisen if node 77 "had never existed" . (Hence the name of this utility - cf. the Prank 

Capra movie.) Note however, that CL is a purely "fictional", counter-factual operation, 
in the sense that it produces a new ( without taking into account the system's dynamics. 
Indeed, no assumption is even being made that CLo-(C) is consistent with the dynamics 
of the system. The sequence of states the node rj is clamped to in the definition of the 
WLU need not be consistent with the dynamical laws embodied in C. 

This dynamics-independence is a crucial strength of the WLU. It means that to 
evaluate the WLU we do not try to infer how the system would have evolved if node rf's 
state were set to at time and the system evolved from there. So long as we know ( 
extending over all time, and so long as we know G, we know the value of WLU. This is 
true even if we know nothing of the dynamics of the system. 

An important example is effect set wonderful life utilities when the set of all nodes is 
partitioned into 'subworlds' in such a way that all nodes in the same subworld u share 
substantially the same effect set. In such a situation, all nodes in the same subworld u 
will have essentially the same personal utilities, exactly as they would if they used team 
game utilities with a "world" given by to. When all such nodes have large intelligence 
values, this sharing of the personal utility will mean that all nodes in the same subworld 
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are acting in a coordinated fashion, loosely speaking. 

The importance of the WLU arises from the following results: 

Theorem 2: i) A system is factored at all ( e C iff for all those Vr/, we can write 

5,(C) = ^^(C^e//,G(C)) (8) 

for some function .) such that da^niC -f^ef / , G) > for all ^ € C and associated G 
values. (The form of the {5^} off of C is arbitrary.) 

ii) In particular, a COIN is factored for personal utilities set equal to the associated 
effect set wonderful life utilities. 

Proof: To prove (i), first write C^^^e// = C t<o-C^^,o<(-C,«^^).,>o- ^^^^ ^ ^' ^CC^,''h>o 
is independent of C^^, and so by definition of C(.) it is a single- valued function of Cy^Q 
for such Therefore Q^eff = C * C- n * /(C- n) some function /(.). Accordingly, 
by Thm. 1, for {5^} of the form stipulated in (i), the system is factored. Going the other 
way, if the system is factored, then by Thm. 1 it can be written as ^■^{C, ^^q, C-^ q; G{C,)). 
Since both C and C-^ q ^ C^^^, we can rewrite this as $^([C^-''-^],t<o, G{Q). 
QED. 

Part (ii) of the theorem follows immediately from part (i) . For pedagogical value though, 
here we instead derive it directly. First, since CL^e//(C) is independent of (^^, ^ for 
all {i,t) G C^^/, so is the Z vector CL^e//(C • ^(C „)), i.e., jCL^e// (C '^^^ • 
C(Co))]f;',i ~ ^ y{ri',t) G C^f^ . This means that viewed as a (" ^^^-paramctcrized 
function from Co to Z, CL^e//(C , ,n • ^i )) ^ single-valued function of the 
components. Therefore G(CL„e//((" • ^(C ))) can only depend on C, , and the non- 
r/ components of Cg- Accordingly, the WLU for C^^^ is just G minus a term that is a 
function of C, and C-^q' "^•^ choosing ^*,,(., ., .) in Thm. 1 to be that difference, we see 
that 77's effect set WLU is of the form necessary for the system to be factored. QED. 

As a generalization of (ii), the system is factored if each node ry's personal utility is (a 
monotonically increasing function of) the WLU for a set cr^ that contains G^^^ . 

For conciseness, except where explicitly needed, for the remainder of this subsection 
we will suppress the argument "C ^^g" , taking it to be implicit. The next result concerning 
the practical importance of effect set WLU is the following: 

Theorem 3: Let a be a set containing G^^^ . Then 

Kg{0 Pc.^_^G(C(C g)) - ac,^_^G(CL,(C(Cg)))|| ■ 
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Proof: Writing it out, 



\,WLU^ (C) 



ll%^,„G(C(Co))-5c.^^^G(CW(C(Co)))|| • 



The second term in the numerator equals 0, by definition of effect set. Dividing by the 
similar expression for A^,g(C) then gives the result claimed. QED. 

So if we expect that ratio of magnitudes of gradients to be large, effect set WLU has 
much higher learnability than team game utility — while still being factored, like team 
game utility. As an example, consider the case where the COIN is a very large system, 
with T] being only a relatively minor part of the system {e.g., a large human economy 
with r] being a "typical John Doc living in Peoria Illinois"). Often in such a system, for 
the vast majority of nodes rj' / rj, how G varies with , will be essentially independent 
of the value q. (For example, how GDP of the US economy varies with the actions of 
our John Doe from Peoria, Illinois will be independent of the state of some Jane Smith 
living in Los Angeles, California.) In such circumstances, Thm. 3 tells us that the effect 
set wonderful life utility for rj will have a far larger learnability than does the world 
utility. 

For any fixed a, if we change the clamping operation (i.e., change the choice of the 
"arbitrary fixed value" we clamp each component to), then we change the mapping 
C q CW(C(C o)), and therefore change the mapping (C^ o,C^^ o) ^ G'(CLa(C(C q)))- 
Accordingly, changing the clamping operation can affect the value of dr G(CLo-(C(C „)) 

— 'rj,0 — 5'-' 

evaluated at some point ( ^. Therefore, by Thm. 3, changing the clamping operation 
can affect \,wlUct{0- So properly speaking, for any choice of a, if we are going to 
use WLUa, we should set the clamping operation so as to maximize learnability. For 
simplicity though, in this paper we will ignore this phenomenon, and simply set the 
clamping operation to the more or less "natural" choice of 0, as mentioned above. 

Next consider the case where, for some node rj, we can write G{C ) as Gi(C„e//) + 
G2(C ^ n • C-r^eff). Say it is also true that r]^s effect set is a small fraction of the set of 
all components. In this case it often true that the values of G{.) are much larger than 
those of Gi{.), which means that partial derivatives of G{.) are much larger than those of 
Gi{.). In such situations the effect set WLU is far more learnable than the world utility, 
due to the following results: 

Theorem 4: If for some node 77 there is a set a containing C^^^ , a function Gi{(^ G Z^^), 
and a function G2(C-^ e Z-^), such that G(C) = Gi{CJ + G2(C-^), then 



Vg(C) ||ac.^^G(CL-,(C(Co)))|| • 
Proof: For brevity, write Gi and G2 both as functions of full C ^ Z, just such func- 
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tions that are only allowed to depend on the components of C that lie in a and those 
components that do not lie in a, respectively. Then the a WLU for node 77 is just 
g-qiO = Gi{Q — Gi(CLo-(C))- Since in that second term we are clamping all the compo- 
nents of C that Gi{.) cares about, for this personal utility 9^ gr){C{C, q)) = 0^ Gi(C(C q))- 
So in particular dr gJC(C n)) = Gi(C(C J) = ~G(CL'JC(C J))7 Now by def- 

— rj.O — — r].0 — j'J —"'7)0 — 

inition of effect set, 5^ ^'^(^ t<o * ^('^o^-' ~ ' since a does not contain C^-^^. So 
%.,o^(^(^,o)) = %,GTiC{Q) = d^_J,{C{C^^)). QED. 

The obvious extensions of Thm.'s 3 and 4 to effect sets with respect to times other than 
can also be proven [ p83| ]. 

An important special case of Thm. 4 is the following: 

Corollary 1: If for some node r] we can write 

i) G(C) = Gi(Cj + G2{[Cjt>o) + GslC ,<o) 
for some set a containing C^^^ , and if 

ii) \\d^_^ G{C{C^^))\\ » ||ac.^_^Gi([C(Co)].)||, 
then 

\,WLuAQ > \,g(C)- 

In practice, to assure that condition (i) of this corollary is met might require that a 
be a proper superset of C^^^. Countervailingly, to assure that condition (ii) is met will 
usually force us to keep a as small as possible. 

One can often remove elements from an effect set and still have the results of this 
section hold. Most obviously, if (t?', t) G C^^f but dc_ , G{Q = 0, we can remove 
(t/', t) from C^^^ without invalidating our results. More generally, if there is a set 
a' € C^'^-'^ such that for each component (ry, 0; i) the chain rule term X)(r)',t)Go-' [^c, , • 
[d(^ ^ {C {Q ^]ri' ,t] = 0) then the effects on G of changes to C,^^ that are "mediated" by 
the members of a' cancel each other out. In this case we can usually remove the elements 
of a' from C^^^ with no ill effects. 

4.3.4 Inducing our salient characteristic 

Usually the mathematics of a descriptive framework — a formal investigation of the 
salient characteristics — will not provide theorems of the sort, "If you modify the COIN 
the following way at time t, the value of the world utility will increase." Rather it 
provides theorems that relate a COIN's salient characteristics with the general properties 
of the coin's entire history, and in particular with those properties embodied in C. 
In particular, the salient characteristic that we are concerned with in this chapter is 
that the system be highly intelligent for personal utilities for which it is factored, and 
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our mathematics concerns the relationship between factoredness, intelligence, personal 
utilities, effect sets, and the like. 

More formally, the desideratum associated with our salient characteristic is that we 
want the COIN to be at a for which there is some set of {g-q} (not necessarily consisting 
of private utilities) such that (a) ^ is factored for the {gr^}, and (b) ef,,g^(C) is large for 
all rj. Now there are several ways one might try to induce the COIN to be at such a 
point. One approach is to have each algorithm controlling t] explicitly try to "steer" the 
worldline towards such a point. In this approach rj needn't even have a private utility 
in the usual sense. (The overt "goal" of the algorithm controlling r) involves finding 
a C with a good associated extremum over the class of all possible 5^, independent of 
any private utilities.) Now initialization of the COIN, i.e., fixing of C n) involves setting 
the algorithm controlling rj, in this case to the steering algorithm. Accordingly, in this 
approach to initialization, we fix (" ^ to a point for which there is some special grf such 
that both C'(Co) i^ factored for g^^, and €??,g^(C'(C q)) i^ large. There is nothing peculiar 
about this. What is odd though is that in this approach we do not know what that 
"special" 5^ is when we do that initialization; it's to be determined, by the unfolding of 
the system. 

In this chapter we concentrate on a different approach, which can involve either ini- 
tialization or macrolearning. In this alternative we deploy the {g^j} as the microlearners' 
private utilities at some t < 0, in a process not captured in C, so as to induce a factored 
COIN that is as intelligent as possible. (It is with that "deploying of the {^r;}" that we 
arc trying to induce our salient characteristic in the COIN.) Since in this approach we 
are using private utilities, we can replace intelligence with its surrogate, learnability. So 
our task is to choose {(7^} which are as learnable as possible while still being factored. 

Solving for such utilities can be expressed as solving a set of coupled partial differential 
equations. Those equations involve the tangent plane to the manifold C, a functional 
trading off (the differential versions of) degree of factoredness and learnability, and any 
communication constraints on the nodes we must respect. While there is not space in the 
current chapter to present those equations, we can note that they are highly dependent on 
the correlations among the components of ("^ ^. So in this approach, in COIN initialization 
we use some preliminary guesses as to those correlations to set the initial {^r;}- For 
example, the effect set of a node constitutes all components that have non-zero 

correlation with Furthermore, by Thm. 2 the system is factored for effect set WLU 
personal utilities. And by Coroll. 1, for small effect sets, the effect set WLU has much 
greater differential utility learnability than does G. Extending the reasoning behind this 
result to all ( (or at least all likely (), we see that for this scenario, the descriptive 
framework advises us to use Wonderful Life private utilities based on (guesses for) the 
associated effect sets rather than the team game private utilities, gn = G Vry. 

In macrolearning we must instead run-time estimate an approximate solution to our 
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partial differential equations, based on statistical inference.Qj As an example, we might 
start with an initial guess as to r/'s effect set, and set its private utility to the associated 
WLU. But then as we watch the system run and observe the correlations among the 
components of we might modify which components we think comprise r^'s effect set, 
and modify ry's personal utility accordingly. 

4.4 Illustrative Simulations of our Descriptive Framework 

As implied above, often one can perform reasonable COIN initialization and/or macrolearn- 
ing without writing down the partial differential equations governing our salient char- 
acteristic explicitly. Simply "hacking" one's way to the goal of maximizing both degree 
of factoredness and intelligibility, for example by estimating effect sets, often results in 
dramatic improvement in performance. This is illustrated in the experiments recounted 
in the next two subsections. 



4.4.1 COIN Initialization 

Even if we don't exactly know the effect set of each node r/, often we will be able to make 
a reasonable guess about which components of C, comprise the "preponderance" of r/'s 
effect set. We call such a set a guessed effect set. As an example, often the primary 
effects of changes to ry's state will be on the future state of rj, with only relatively minor 
effects on the future states of other nodes. In such situations, we would expect to still 
get good results if we approximated the effect set WLU of each node rj with a WLU 
based on the guessed effect set C ^^q. In other words, we would expect to be able to 
replace WLU„e// with WLU<' and still get good performance. 



This phenomenon was borne out in the experiments recounted in |285] that used 
COIN initialization for distributed control of network packet routing. In a conventional 
approach to packet routing, each router runs what it believes (based on the information 
available to it) to be a shortest path algorithm (SPA), i.e., each router sends its packets in 
the way that it surmises will get those packets to their destinations most quickly. Unlike 
with an approach based on our COIN framework, with SPA-based routing the routers 
have no concern for the possible deleterious side-effects of their routing decisions on the 
global performance {e.g., they have no concern for whether they induce bottlenecks). 
We performed simulations in which we compared such a COIN-based routing system 
to an SPA-based system. For the COIN-based system G was global throughput and 
no macrolearning was used. The COIN initialization was to have each router's private 



^^Recall that in the physical world, it is often useful to employ devices using algorithms that are 
based on probabilistic concepts, even though the underlying system is ultimately deterministic. (Indeed, 
theological Bayesians invoke a "degree of belief" interpretation of probability to demand such an approach 



— see 1 282] for a discussion of the legitimacy of this viewpoint.) Similarly, although we take the 
underlying system in a COIN to be deterministic, it is often useful to use microlearners or — as here — 
macrolearners that are based on probabilistic concepts. 
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utility be a WLU based on an associated guessed effect set generated a priori. In addition, 
the COIN-based system was realistic in that each router's reinforcement algorithm had 
imperfect knowledge of the state of the system. On the other hand, the SPA was an 
idealized "best-possible" system, in which each router knew exactly what the shortest 
paths were at any given time. Despite the handicap that this disparity imposed on the 
COIN-based system, it achieved significantly better global throughput in our experiments 
than did the perfect-knowledge SPA-based system, and in particular, avoided the Braess' 
Paradox that was built-in to some of those systems [265]. 

The experiments in [p88|| were primarily concerned with the application of packet- 
routing. To concentrate more precisely on the issue of COIN initialization, we ran 
subsequent experiments on variants of Arthur's famous "El Parol bar problem" (see 
Section ^). To facilitate the analysis we modified Arthur's original problem to be more 
general, and since we were not interested in directly comparing our results to those in 
the literature, we used a more conventional (and arguably "dumber") machine learning 
algorithm than the ones investigated in M, 47, ISO, EM. 



In this formulation of the bar problem [289], there are agents, each of whom picks 
one of seven nights to attend a bar the following week, a process that is then repeated. 
In each week, each agent's pick is determined by its predictions of the associated rewards 
it would receive. These predictions in turn are based solely upon the rewards received 
by the agent in preceding weeks. An agent's "pick" at week t (i.e., its node's state at 
that week) is represented as a unary seven-dimensional vector. (See the discussion in the 
definitions subsection of our representing discrete variables as Euclidean variables.) So 
r/'s zeroing its state in some week, as in the CL^ operation, essentially means it elects 
not to attend any night that week. 

The world utility is 

G(c) = E^(c,). 
t 

where: ^(C j) = J2k=i1kixkiC ^kiC is the total attendance on night k at week 
7fe(y) = Qifcy exp (— y/c); and c and each of the {ofc} are real-valued parameters. 
Intuitively, the "world reward" R is the sum of the global "rewards" for each night in 
each week. It reflects the effects in the bar as the attendance profile of agents changes. 
When there are too few agents attending some night, the bar suffers from lack of activity 
and therefore the global reward for that night is low. Conversely, when there are too 
many agents the bar is overcrowded and the reward for that night is again low. Note 
that 7fc(-) reaches its maximum when its argument equals c. 

In these experiments we investigate two different a's. One treats all nights equally; 
d = [1111111]. The other is only concerned with one night; a = [0 7 0]. In 
our experiments, c = 6 and N is chosen to be 4 times larger than the number of agents 
necessary to have c agents attend the bar on each of the seven nights, i.e., there are 
4 • 6 • 7 = 168 agents (this ensures that there are no trivial solutions and that for the 
world utility to be maximized, the agents have to "cooperate"). 



60 



As explained below, our microlearning algorithms worked by providing a real-valued 
"reward" signal to each agent at each week t. Each agent's reward function is a surrogate 
for an associated utility function for that agent. The difference between the two functions 
is that the reward function only reflects the state of the system at one moment in time 
(and therefore is potentially observable), whereas the utility function reflects the agent's 
ultimate goal, and therefore can depend on the full history of that agent across time. 

We investigated three agent reward functions. One was based on effect set WLU. The 
other two were "natural" rewards included for comparison purposes. With djj the night 
selected by t], the three rewards are: 

Uniform Division (UD): r^(C_^) = 7d„{xd„{C^^))/xd,,{C^^) 
Global (G): r^{C j ^ i?(Cp = E 7fc(xfc(C ,)) 

k=l 

Wonderful Life (WL): r,,(C J = R{C^) - R{CL^ ^(C J) 

= Id, (xd, {Q) -Id, (xd, (CL^^^^ (C ,))) 



The conventional UD reward is a natural "naive" choice for the agents' reward; the 
total reward on each night gets uniformly divided among the agents attending that night. 
If we take g-qiC,) = J2t '^■'?(C j) (^-e., r/'s utility is an undiscounted sum of its rewards), then 
for the UD reward G(C) = J2rj dviOi so that the system is weakly trivial. The original 
version of the bar problem in the physics literature [^] is the special case where UD 
reward is used but there are only two "nights" in the week (one of which corresponds 
to "staying at home"); a is uniform; and "fkixk) = XkQ{ckN — Xk) for some vector c, 
taken to equal (.6, .4) in the very original papers. So the reward to agent ry is 1 if it 
attends the bar and attendance is below capacity, or if it stays at home and the bar is 
over capacity. Reward is otherwise. (In addition, unlike in our COIN-based systems, in 
the original work on the bar problem the microlearners work by explicitly predicting the 
bar attendance, rather than by directly modifying behavior to try to increase a reward 
signal.) 

In contrast to the UD reward, providing the G reward at time t to each agent results in 
all agents receiving the same reward. This is the team game reward function, investigated 
for example in [^]. For this reward function, the system is automatically factored 
if we define gri{C) = J2t''"v(^ t^- However, evaluation of this reward function requires 
centralized communication concerning all seven nights. Furthermore, given that there 
are 168 agents, G is likely to have poor learnability as a reward for any individual agent. 

This latter problem is obviated by using the WL reward, where the subtraction of 
the clamped term removes some of the "noise" of the activity of all other agents, leaving 
only the underlying "signal" of how the agent in question affects the utility. So one 
would expect that with the WL reward the agents can readily discern the effects of their 
actions on their rewards. Even though the conditions in Coroll. 1 don't holdp^, this 
^"The t = elements of Cf are iust C , but the contributions of C to G cannot be written as 
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reasoning accords with the imphcit advice of Coroh. 1 under the approximation of the 
t = effect set as C^^^ ~ C ^>o- other words, it agrees with that corollary's implicit 
advice under the identification of C as r7's t = guessed effect set. 

-l.ri,t>0 ' ^ 

In fact, in this very simple system, we can explicitly calculate the ratio of the WL 
reward's learnability to that of the G reward, by recasting the system as existing for 
only a single instant so that C^-^-^ ~ ^rjO ^^^^^^V ^^'^ then applying Thm. 3. So for 
example, say that all = 1, and that the number of nodes N is evenly divided among 
the seven nights. The numerator term in Thm. 3 is a vector whose components are 
some of the partials of G evaluated when Xk{C q) = N/7. This vector is 7(A^ — 1) 
dimensional, one dimension for each of the 7 components of (the unary vector comprising) 
each node in r/. For any particular t]' ^ rj and night i, the associated partial derivative is 
Efc[e""'=^^.o)/^(l _ Xfc(C o)/c) • X^fc(Co))], where as usual "C^,^^.." indicates the i'th 
component of the unary vector C,^, q- Since 9( ^ X^k{C q)) = ^i,k, for any fixed i and 
rj' , this sum just equals e^~^l'^'^^ (1 — N/7c). Since there are 7(A^ — 1) such terms, after 
taking the norm we obtain |e(-^/^'=) [1 - N/7c] ^7{N 

The denominator term in Thm. 3 is the difference between the gradients of the 
global reward and the clamped reward. These differ on only — 1 terms, one term 
for that component of each node Vj' ^ r] corresponding to the night r] attends. (The 
other 6A^ — 6 terms are identical in the two partials and therefore cancel.) This yields 
|g(-7V/7c) _ _ _ y/N -I. Combining with the result of the 

previous paragraph, our ratio is \^/7 (jv_y^)(^~I/c)+7ei/c I - H- 

In addition to this learnability advantage of the WL reward, to evaluate its WL 
reward each agent only needs to know the total attendance on the night it attended, 
so no centralized communication is required. Finally, although the system won't be 
perfectly factored for this reward (since in fact the effect set of ry's action at t would be 
expected to extend a bit beyond C^^)i one might expect that it is close enough to being 
factored to result in large world utility. 

Each agent keeps a seven dimensional Euclidean vector representing its estimate of 
the reward for attending each night of the week. At the end of each week, the component 
of this vector corresponding to the night just attended is proportionally adjusted towards 
the actual reward just received. At the beginning of the succeeding week, the agent picks 
the night to attend using a Boltzmann distribution with energies given by the components 
of the vector of estimated rewards, where the temperature in the Boltzmann distribution 
decays in time. (This learning algorithm is equivalent to Glaus and Boutilier's |57| 
independent learner algorithm for multi-agent reinforcement learning.) We used the 
same parameters (learning rate, Boltzmann temperature, decay rates, etc.) for all three 
reward functions. (This is an extremely primitive RL algorithm which we only chose 
for its pedagogical value; more sophisticated RL algorithms are crucial for eliciting high 
intelligence levels when one is confronted with more complicated learning problems.) 

a sum of a C contribution and a C contribution. 

— 77,t = — '?7,t=0 



62 



CD 
O 

re 
E 



CD 
CL 
CD 
O) 

re 

> 
< 




500 
Weeks 



1000 



CD 
O 

c 
re 



o 
t 

<D 
0. 

O 
D) 

n) 




500 
Weeks 



1000 



[0 7 0] (left) and when 



Figure 1: Average world reward when a - 
d = [1111111] (right). In both plots the top curve is WL, middle is G, and 
bottom is UD. 



Figure || presents world reward values as a function of time, averaged over 50 separate 
runs, for all three reward functions, for both a = [1111111] and a = [0 00700 0]. 
The behavior with the G reward eventually converges to the global optimum. This is 
in agreement with the results obtained by Crites [30| for the bank of elevators control 
problem. Systems using the WL reward also converged to optimal performance. This 
indicates that for the bar problem our approximations of effects sets are sufficiently 
accurate, i.e., that ignoring the effects one agent's actions will have on future actions of 
other agents does not significantly diminish performance. This reflects the fact that the 
only interactions between agents occurs indirectly, via their affecting each others' reward 
values. 

However since the WL reward is more learnable than than the G reward, convergence 
with the WL reward should be far quicker than with the G reward. Indeed, when 
a = [0 00700 0], systems using the G reward converge in 1250 weeks, which is 5 
times worse than the systems using WL reward. When a = [1111111] systems take 
6500 weeks to converge with the G reward, which is more than 30 times worse than the 
time with the WL reward. 

In contrast to the behavior for reward functions based on our COIN framework, use 
of the conventional UD reward results in very poor world reward values, values that 
deteriorated as the learning progressed. This is an instance of the TOG. For example, 
for the case where a = [0 7 0], it is in every agent's interest to attend the same 
night — but their doing so shrinks the world reward "pie" that must be divided among 
all agents. A similar TOG occurs when a is uniform. This is illustrated in fig. |2| which 
shows a typical example of daily attendance figures {{xkiC f)}) foi' each of the three 
reward functions for t = 2000. In this example optimal performance (achieved with the 
WL reward) has 6 agents each on 6 separate nights, (thus maximizing the reward on 6 
nights), and the remaining 132 agents on one night. 

Figure shows how t = 2000 performance scales with N for each of the reward signals 
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Figure 2: Typical daily attendance when a = [1 1 1 1 1 1 1] for WL (left), G (center), 
and UD (right). 
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Figure 3: Behavior of each reward function with respect to the number of agents for 
a = [0 7 0]. 



for a = [0 7 0]. Systems using the UD reward perform poorly regardless of N. 
Systems using the G reward perform well when N is low. As increases however, it 
becomes increasingly difficult for the agents to extract the information they need from the 
G reward. (This problem is significantly worse for uniform a.) Because of their superior 
learnability, systems using the WL reward overcome this signal-to- noise problem (i.e., 
because the WL reward is based on the difference between the actual state and the state 
where one agent is clamped, it is much less affected by the total number of agents). 



4.5 Macrolearning 

In the experiments recounted above, the agents were sufficiently independent that assum- 
ing they did not affect each other's actions (when forming guesses for effect sets) allowed 
the resultant WL reward signals to result in optimal performance. In this section we 
investigate the contrasting situation where we have initial guesses of effect sets that are 
quite poor and that therefore result in bad global performance when used with WL re- 
wards. In particular, we investigate the use of macrolearning to correct those guessed 
effect sets at run-time, so that with the corrected guessed effect sets WL rewards will 
instead give optimal performance. This models real-world scenarios where the system 
designer's initial guessed effect sets arc poor approximations of the actual associated 
effect sets and need to be corrected adaptively. 
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In these experiments the bar problem is significantly modified to incorporate con- 
straints designed to result in poor G when the WL reward is used with certain initial 
guessed effect sets. To do this wc forced the nights actually attended by some of the 
agents (followers) to agree with those attended by other agents (leaders), regardless 
of what night those followers "picked" via their microlearning algorithms. (For leaders, 
picked and actually attended nights were always the same.) We then had the world utility 
be the sum, over all leaders, of the values of a triply-indexed reward matrix whose in- 
dices are the nights that each leader-follower set attends: G(C) = J2t^i^li{t),fli{t),f2i{t) 
where li{t) is the night the i*^ leader attends in week t, and fli{t) and f2i(t) are the 
nights attended by the followers of leader i, in week t (in this study, each leader has two 
followers). We also had the states of each node be one of the integers {0, 1, 6} rather 
than (as in the bar problem) a unary seven-dimensional vector. This was a bit of a 

contrivance, since constructions like df aren't meaningful for such essentially symbolic 

—77,0 

interpretations of the possible states C^^. As elaborated below, though, it was helpful 
for constructing a scenario in which guessed effect set WLU results in poor performance, 
i.e., a scenario in which we can explore the application of macrolearning. 

To see how this setup can result in poor world utility, first note that the system's 
dynamics is what restricts all the members of each triple (/j(t), /lj(t), /2j(t)) to equal 
the night picked by leader i for week t. So fli{t) and /2j(f) are both in leader i's actual 
effect set at week t — whereas the initial guess for f's effect set may or may not contain 
nodes other than li{t). (For example, in the bar problem experiments, the guessed effect 
set does not contain any nodes beyond li{t).) On the other hand, G and R are defined for 
all possible triples {li{t), fli{t), /2j(t)). So in particular, R is defined for the dynamically 
unrealizable triples that can arise in the clamping operation. This fact, combined with 
the leader-follower dynamics, means that for certain i?'s there exist guessed effect sets 
such that the dynamics assures poor world utility when the associated WL rewards are 
used. This is precisely the type of problem that macrolearning is designed to correct. 

As an example, say each week only contains two nights, and 1. Set Rm = 1 and 
Rooo = 0. So the contribution to G when a leader picks night 1 is 1, and when that 
leader picks night it is 0, independent of the picks of that leader's followers (since the 
actual nights they attend are determined by their leader's picks). Accordingly, we want 
to have a private utility for each leader that will induce that leader to pick night 1. Now 
if a leader's guessed effect set includes both of its followers (in addition to the leader 
itself), then clamping all elements in its effect set to results in an R value of i?ooo = 0- 
Therefore the associated guessed effect set WLU will reward the leader for choosing night 
1, which is what we want. (For this case WL reward equals i?iii — i?ooo = 1 if the leader 
picks night 1, compared to reward i?ooo — -Rooo = for picking night 0.) 

However consider having two leaders, ii and 12, where ii's guessed effect set consists of 
ii itself together with the two followers of 12 (rather than together with the two followers 
of ii itself). So neither of leader ii's followers are in its guessed effect set, while ii itself is. 
Accordingly, the three indices to ii's R need not have the same value. Similarly, clamping 
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the nodes in its guessed effect set won't affect the values of the second and third indices 
to ii's i?, since the values of those indices are set by ii's followers. So for example, if 12 
and its two followers go to night in week 0, and ii and its two followers go to night 1 in 
that week, then the associated guessed effect set wonderful life reward for ii for week 
is GiC^t^o) - G{CLi,^ (o),/i,2 (o),/2,2 (0) (C ^=0)) = -^'n (o),/i«i (o),/2.^ (0) + Rk^ {o),fu,^ {0)J2,^ (0) - 
[-Ro,/i,j(o),/2,^(o) + Rk^{o),o,o]- This equals Rm + i?ooo - ^011 - ^000 = 1 - -Ron- Simply 
by setting Rqu > 1 we can ensure that this is negative. Conversely, if leader ii had gone 
to night 0, its guessed effect WLU would have been 0. So in this situation leader ii will 
get a greater reward for going to night than for going to night 1. In this situation, 
leader zi's using its guessed effect set WLU will lead it to make the wrong pick. 

To investigate the efficacy of the macrolearning, two sets of separate experiments 
were conducted. In the first one the reward matrix R was chosen so that if each leader is 
maximizing its WL reward, but for guessed effect sets that contain none of its followers, 
then the system evolves to minimal world reward. So if a leader incorrectly guesses that 
some a is its effect set even though a doesn't contain both of that leader's followers, 
and if this is true for all leaders, then we are assured of worst possible performance. 
In the second set of experiments, we investigated the efficacy of macrolearning for a 
broader spectrum of reward matrices by generating those matrices randomly. We call 
these two kinds of reward matrices worst-case and random reward matrices, respectively. 
In both cases, if it can modify the initial guessed effect sets of the leaders to include their 
followers, then macrolearning will induce the system to be factored. 

The microlearning in these experiments was the same as in the bar problem. All 
experiments used the WL personal reward with some (initially random) guessed effect 
set. When macrolearning was used, it was implemented starting after the microlearning 
had run for a specified number of weeks. The macrolearner worked by estimating the 
correlations between the agents' selections of which nights to attend. It did this by exam- 
ining the attendances of the agents over the preceding weeks. Given those estimates, for 
each agent rj the two agents whose attendances were estimated to be the most correlated 
with those of agent rj were put into agent ry's guessed effect set. Of course, none of this 
macrolearning had any effect on global performance when applied to follower agents, but 
the macrolearning algorithm cannot know that ahead of time; it applied this procedure 
to each and every agent in the system. 

Figure Q presents averages over 50 runs of world reward as a function of weeks using 
the worst-case reward matrix. For comparison purposes, in both plots the top curve 
represents the case where the followers are in their leader's guessed effect sets. The 
bottom curve in both plots represents the other extreme where no leader's guessed effect 
set contains either of its followers. In both plots, the middle curve is performance when 
the leaders' guessed effect sets are initially random, both with (right) and without (left) 
macrolearning turned on at week 500. 

The performance for random guessed effect sets differs only slightly from that of hav- 
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Figure 4: Leader-follower problem with worst case reward matrix. In both plots, every 
follower is in its leader's guessed effect set in the top curve, no follower is in its leader's 
guessed effect set in the bottom curve, and followers are randomly assigned to guessed 
effect sets of the leaders in the middle curve. The two plots are without (left) and with 
(right) macrolearning at 500 weeks. 



ing leaders' guessed effect sets contain none of their followers; both start with poor values 
of world reward that deteriorates with time. However, when macrolearning is performed 
on systems with initially random guessed effect sets, the system quickly rectifies itself 
and converges to optimal performance. This is reflected by the sudden vertical jump 
through the middle of the right plot at 500 weeks, the point at which macrolearning 
changed the guessed effect sets. By changing those guessed effect sets macrolearning 
results in a system that is factored for the associated WL reward function, so that those 
reward functions quickly induced the maximal possible world reward. 
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Figure 5: Leader-follower problem for random reward matrices. The ordering of the 
plots is exactly as in Figure 4. Macrolearning is applied at 2000 weeks, in the right plot. 



Figure ^ presents performance averaged over 50 runs for world reward as a function of 
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weeks using a spectrum of reward matrices selected at random. The ordering of the plots 
is exactly as in Figure Macrolearning is applied at 2000 weeks, in the right plot. The 
simulations in Figure |5| were lengthened from those in Figure ^ because the convergence 
time of the full spectrum of reward matrices case was longer. 

In figure ^ the macrolearning resulted in a transient degradation in performance at 
2000 weeks followed by convergence to the optimal. Without macrolearning the system's 
performance no longer varied after 2000 weeks. Combined with the results presented in 
Figure these experiments demonstrate that macrolearning induces optimal perfor- 
mance by aligning the agents' guessed effect sets with those agents that they actually do 
influence the most. 

5 CONCLUSION 

Many distributed computational tasks cannot be addressed by direct modeling of the 
underlying dynamics, or are at best poorly addressed that way due to robustness and 
scalability concerns. Such tasks should instead be addressed by model-independent ma- 
chine learning techniques. In particular, Reinforcement Learning (RL) techniques are 
often a natural choice for how to address such tasks. When — as is often the case — we 
cannot rely on centralized control and communication, such RL algorithms have to be 
deployed locally, throughout the system. 

This raises the important and profound question of how to configure those algorithms, 
and especially their associated utility functions, so as to achieve the (global) computa- 
tional task. In particular we must ensure that the RL algorithms do not "work at 
cross-purposes" as far as the global task is concerned, lest phenomena like tragedy of the 
commons occur. How to initialize a system to do this is a novel kind of inverse problem, 
and how to adapt a system at run-time to better achieve such a global task is a novel 
kind of learning problem. We call any distributed computational system analyzed from 
the perspective of such an inverse problem a Collective INtelligence (COIN). 

As discussed in the literature review section of this chapter, there are many ap- 
proaches/fields that address aspects of COINs. These range from multi-agent systems 
through conventional economics and on to computational economics. (Human economies 
are a canonical model of a functional COIN.) They range onward to game theory, various 
aspects of distributed biological systems, and on through physics, active walker models, 
and recurrent neural nets. Unfortunately, none of these fields seems appropriate as a 
general approach to understanding COINs. 

After this literature review we present a mathematical theory for COINs. We then 
present experiments on two test problems that validate the predictions of that theory 
for how best to design a COIN to achieve a global computational task. The first set of 
experiments involves a variant of Arthur's famous El Farol Bar problem. The second 
set instead considers a leader-follower problem that is hand-designed to cause maximal 
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difficulty for the advice of our theory on how to initiahze a COIN. This second set of 
experiments is therefore a test of the on-line learning aspect of our approach to COINs. 
In both experiments the procedures derived from our theory, procedures using only local 
information, vastly outperformed natural alternative approaches, even such approaches 
that exploited global information. Indeed, in both problems, following the theory sum- 
marized in this chapter provides good solutions even when the exact conditions required 
by the associated theorems hold only approximately. 

There are many directions in which future work on COINs will proceed; it is a vast and 
rich area of research. We are already successfully applying our current understanding of 
COINs, tentative as it is, to internet packet routing problems. We are also investigating 
COINs in a more general optimization context where economics-inspired market mech- 
anisms are used to guide some of the interactions among the agents of the distributed 
system. The goal in this second body of work is to parallelize and solve numerical opti- 
mization problems where the concept of an "agent" may not be in the natural definition 
of the problem. We also intend to try to apply our current COIN framework to the prob- 
lem of designing high-occupancy toll lanes in vehicular traffic, and to help understand 
the "design space" necessary for distributed biochemical entities like pre-genomic cells. 

Acknowledgements: The authors would like to thank Ann Bell, Michael New, Peter 
Norvig and Joe Sill for their comments. 
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